Salesforce

Crawl documents that are stored in Salesforce.

IBM Cloud Pak for Data IBM Software Hub

This information applies only to installed deployments. For more information about connecting to Salesforce from a managed deployment, see Salesforce.

What documents are crawled

Knowledge Articles are crawled only if their version is published and their languages is en-us.
Only documents that are supported by Discovery are crawled; all others are ignored. For more information, see Supported file types.
When a source is recrawled, new documents are added, updated documents are modified to the current version, and deleted documents are deleted from the collection's index during refresh.
All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.

Discovery can crawl the following objects:

Any default and custom objects that you have access to
Accounts
Contacts
Cases
Contracts
Knowledge articles
Attachments

Data source requirements

In addition to the data source requirements for all installed deployments, your Salesforce data source must meet the following requirements:

The instance that you plan to connect to must be part of an Enterprise plan or higher.
You must obtain any required service licenses for the data source that you want to connect to. For more information about licenses, contact the system administrator of the data source.

For more information about Salesforce, see the Salesforce developer documentation.

Prerequisite step

To crawl documents in Salesforce, Discovery uses a Web Service Description Language (WSDL) file. The WSDL file defines a Web service to generate an API that manages access.

If you plan to crawl documents from both a Sandbox and Production instance of Salesforce, you must establish a connection to each web service separately. You must download JAR files from each web service and set up separate collections.

For information about downloading the WSDL JAR files, see the following links:

Download the following JAR files:
- force-partner.jar (from partner WSDL)
- force-metadata.jar (from metadata WSDL)
- force-wsc.jar (from Force.com Web Service Connector (WSC))
- commons-beanutils.jar (from Apache Commons BeanUtils)
Compress the JAR files into a compressed file. You will upload the compressed file to Discovery in the next procedure.

Connecting to a Salesforce data source

From your Discovery project, complete the following steps:

From the navigation pane, choose Manage collections.
Click New collection.
Click Salesforce, and then click Next.
Name the collection.
If the language of the documents in Salesforce is not English, select the appropriate language.

For a list of supported languages, see Language support.
Optional: Change the synchronization schedule.

For more information, see Crawl schedule options.
In the Specify what you want to crawl section, enter values in the following fields:

Username

The username to call the Salesforce API.

Password

The password of the specified user.

Security Token

The security token of the user to call Salesforce API.

Jar zip archive file

Upload a compressed file that contains the JAR files that you downloaded earlier. Or select a compressed file that you uploaded previously to reuse it.
Optional: Expand the Proxy settings section to add information that is required if you are using a proxy server to access the data source server.
- Enable proxy settings: Set the switch to On, and then add the following information:
  
  Username
  
  The proxy server username to use to authenticate, if the proxy server requires authentication. If you do not know your username, you can get it from the administrator of your proxy server.
  
  Password
  
  The proxy server password to use to authenticate, if the proxy server requires authentication. If you do not know your password, you can get it from the administrator of your proxy server.
  
  Proxy server host name or IP address
  
  The hostname or the IP address of the proxy server.
  
  Proxy server port number
  
  The network port that you want to connect to on the proxy server.
In the Object Types* section, specify the object types to crawl.

The default behavior is to crawl all object types.
- For custom object names, append __c to match the Salesforce API convention for custom object names. For example, to crawl MyCustomObject, specify MyCustomObject__c.
- Do not specify a comment object, such as FeedComment, CaseComment, IdeaComment, without also specifying the corresponding root object, such as FeedItem, Case, and Idea.
- If you specify a tag object, you must also specify its parent. For example, do not specify the AccountTag object without also specifying the Account object.
If you want the crawler to extract text from images on the site, expand More processing settings, and set Apply optical character recognition (OCR) to On.

When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.
Click Finish.

The collection is created quickly. It takes more time for the data to be processed as it is added to the collection.

If you want to check the progress, go to the Activity page. From the navigation pane, click Manage collections, and then click to open the collection.