IBM Cloud Docs
Box

Box

Crawl documents that are stored in a Box data source.

IBM Cloud IBM Cloud only

This information applies only to managed deployments. For more information about connecting to Box from an installed deployment, see Box.

What documents are crawled

During the initial crawl of the content, documents from all of the folders that can be accessed from your Box application are crawled and added to your collection. Box notes are stored in JSON format, so Discovery also ingests any Box notes in the specified folders.

The following table illustrates the objects that Discovery can crawl.

Table 1. Data sources crawling support
Data source Supports scheduled document refreshes? Objects that are crawled
Box (App access) No Files, folders that you share explicitly
Box (Enterprise access) Yes (New and modified documents only) Files, folders

When you configure Box with App access only, you must create App Users and share the files that you want to crawl with these users. You cannot crawl Box files that are shared only by the Service Account.

For more information about access, see these Box documentation help topics:

Documents that are deleted from Box are not deleted from the collection.

All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.

Data source requirements

In addition to the data source requirements for all managed deployments, your Box data source must meet the following requirement:

You must obtain any required service licenses for the data source that you want to connect to. For more information about licenses, contact the system administrator of the data source.

Prerequisite step

You must create a custom application in Box before you can connect to Box from Discovery.

  1. In Box, create a custom app that uses Server Authentication with JWT as its authentication method.

    For detailed steps, see Setup with JWT in the Box Developer Documentation.

Follow these guidelines when you create the app:

  • During the setup procedure, choose to use the Server Authentication with JWT method to verify application identity with a key pair.

  • When you configure the custom app, you can choose to use one of the application access levels:

    • App access only
    • App access plus Enterprise access

    Refreshing documents on a schedule is supported only when you choose App access plus Enterprise access.

    If you set up the connection with App access, you must create App Users and share the files that you want to crawl with the App Users you define. With this configuration, new and modified documents are not crawled during a refresh.

    • If you are an administrator, configure App access plus Enterprise access. Otherwise, you can configure the app to have App access. However, you must get application approval from a Box administrator.

    • For both application access levels, specify the following settings:

    • Choose the following scopes:

      • Read all folders stored in Box
      • Write all folders stored in Box
      • Manage Users

      For apps with Enterprise access only: Add this extra scope:

      • Manage Enterprise Properties
    • Enable the following advanced features:

      • Make API calls using the as-user header
      • Generate User Access Tokens
  • Get the custom app authorized by an administrator.

    For more information, see App approval in the Box Developer Documentation.

  • After the app is created, authorized, and authentication is configured, download the app settings as a JSON file from the dev console.

    You provide the following information from this file when it is requested later:

    • client_id
    • enterprise_id
    • client_secret
    • public_key_id
    • private_key
    • passphrase

Connecting to the Box data source

From your Discovery project, complete the following steps:

  1. From the navigation pane, choose Manage collections.

  2. Click New collection.

  3. Click the link next to the Need to connect to a data source? field, click Box, and then click Next.

  4. Refer to the values from the Box app settings JSON file that you downloaded during the previous procedure to complete the following fields:

    Client ID
    The private key that you specify when you configure your Box app.
    Client Secret
    The client secret that you specify when you configure your Box app.
    Enterprise ID
    The enterprise ID of the Box account.
    Public Key ID
    The public key ID that Box generates.
    Private Key
    A part of the key pair that is generated to interact with the Box website.
    Passphrase
    The passphrase that is required to decrypt the private key if the private key is an encrypted file.
  5. Click Next.

  6. Name the collection.

  7. If the language of the documents in Box is not English, select the appropriate language.

    For a list of supported languages, see Language support.

  8. Optional: Change the synchronization schedule.

    For more information, see Crawl schedule options.

  9. Choose the folders that you want to crawl.

  10. If you want to limit the types of files to add to the collection, you can list the file extensions for file types to either include or exclude.

    When you choose to list extensions for file types to exclude, you must add at least one file extension.

    For a list of supported file types, see Supported file types.

  11. If you want the web crawl to extract text from images on the site, expand More processing settings, and set Apply optical character recognition (OCR) to On.

    When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.

  12. Click Finish.

The collection is created quickly. It takes more time for the data to be processed as it is added to the collection.

If you want to check the progress, go to the Activity page. From the navigation pane, click Manage collections, and then click to open the collection.

Currently, not all documents are refreshed during scheduled recrawls. For more information, see the release note.