IBM Cloud Docs
Overview of IBM Cloud data sources

Overview of IBM Cloud data sources

You can use IBM Watson® Discovery on the IBM Cloud® to connect to and crawl documents from remote sources.

IBM Cloud IBM Cloud only

This information applies only to managed deployments. For more information about IBM Cloud Pak for Data data sources, see Overview of Cloud Pak for Data data sources.

Connect to an external data source so that you can pull documents into Discovery on a schedule. Discovery pulls documents from the data source by crawling the data source. Crawling is the process of systematically browsing and retrieving documents from a starting location that you specify. When the crawler first processes a data source, it performs a full crawl. Each time the crawler runs after the initial crawl, it performs a refresh, where it checks for new and changed files only.

All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.

You can use Discovery to crawl from the following data sources:

Your data source isn't listed? Check whether IBM® App Connect has a connector to the data source. You can use a default connector that is built for App Connect to send data from a data source to Discovery. For a list of the data sources supported by App Connect default connectors, see Connectors A-Z. For more information about integrating App Connect with Discovery, see How to use IBM App Connect with IBM Watson® Discovery.

To use an App Connect connector, you must create a separate App Connect instance. Costs that are incurred from a paid App Connect instance are not included with the cost of using Discovery. Except for indexing, Discovery does not support any integration with App Connect that you perform on your own.

Data source requirements

The following requirements and limitations are specific to Discovery on IBM Cloud:

  • A collection can connect to only one data source.

  • For more information about size limits, which can differ per plan, see the following topics:

Installing IBM Secure Gateway for on-premises data

IBM® Secure Gateway for IBM Cloud® is being deprecated. Existing clients who use Secure Gateway can get guidance on migrating to the IBM Cloud Satellite® Connector before the End of Support date. For more information, see the Secure Gateway deprecation dates and deprecation details..

To connect to an on-premises data source, you first need to download, install, and configure IBM® Secure Gateway for IBM Cloud®.

After you install the client for one on-premises data source, you can reuse it for other data sources in the project.

The number of gateways that you can create is limited to 50.

For more information, see About Secure Gateway.

You can use the IBM Secure Gateway with the following connectors only:

To install IBM® Secure Gateway for IBM Cloud®, complete the following steps:

  1. From the data source configuration page, click Manage connection.

  2. On the Download and install Secure Gateway client page, download the appropriate version of IBM® Secure Gateway for IBM Cloud®.

  3. After you complete the download, click Download Secure Gateway and Continue.

  4. When prompted, enter the Gateway ID and Token that are displayed.

    For more information, see Installing the client.

  5. On the machine where the Secure Gateway Client is running, open the Secure Gateway dashboard at http://localhost:9003.

  6. Click add ACL on the dashboard, and add the URL of the data source that you want to access to the Allow access list.

    For example, hostname: mycompany.sharepoint.com or mycompanywebsite.com and port: 80.

  7. Return to Discovery, and click Continue.

    • If the connection is successful, a Connection successful message is displayed.
    • If the connection is unsuccessful, open the IBM® Secure Gateway for IBM Cloud® dashboard, and verify that the endpoints on the Allow access list are correct.

Data source connection and data isolation

When you connect to external data sources, you reduce the data isolation of your service instance because data in transit between the source and the service cannot be isolated. All other data isolation (at-rest, administration, query) remains in full. All in-flight communication among services and data sources is encrypted with TLS v1.2. The private keys for the TLS certificates are encrypted at rest with AES-256-GCM encryption. The service certificates expire every three years and the certificate revocation lists are updated monthly. All credentials are sent over an encrypted connection that uses TLS v1.2 and are encrypted at rest with AES-256 encryption. Connections to data sources use the secure protocols that are supported by the data sources.

Viewing collections that are connected to a gateway

You can view a list of collections that are connected to a particular gateway. Complete the following steps to view collections that share a particular gateway:

  1. From the My projects page, click Data usage and GDPR.

  2. Click On premises.

    Collections that share a common gateway are displayed in the Connected collections list.

Connecting to data sources with IP restrictions

Some data sources allow crawlers from only a limited number of trusted network addresses or domains to access and process their data. If one of the data sources that you want to connect to limits access in this way, you can add IBM-managed IP addresses to the allowlist of the data source.

Network addresses are subject to change from time to time. You can monitor for updates to these addresses by subscribing to the repo notifications for this page. Click Edit Topic and then select Watching in the Notifications dialog of the repo.

  • For service instances that are hosted in a US-based data center and that were created on or after 1 May 2020, add the following IP addresses:

    150.238.21.0/28
    169.48.255.224/28
    174.36.69.128/28
    
  • For service instances that are hosted in non-US data centers and that were created on or after 21 February 2021, add the following IP addresses:

    159.122.203.64/28
    158.175.114.128/28
    158.176.107.48/28
    
  • For a list of IP addresses that you can add to an allowlist for services instances that were created before 1 May 2020 (US) and before 21 February 2021 (non-US), see the network addresses that are listed for Cloud Foundry.

    • Refer to the Dallas data center IP addresses for all US-hosted service instances.
    • Refer to the London data center IP addresses for all service instances that are hosted outside the US.