Microsoft SharePoint On Prem

Crawl documents that are stored in a Microsoft SharePoint data source that is hosted on premises.

IBM Cloud IBM Cloud only

This information applies only to managed deployments. For more information about connecting to an on-premises SharePoint data source from an installed deployment, see SharePoint On Prem.

What documents are crawled

During the initial crawl of the content, documents from all of the objects that can be accessed from the site collection path that you specify are crawled and added to your collection. Custom metadata that is associated with the SharePoint content is crawled also. You can crawl one site collection path per collection. You cannot crawl Personal SiteCollections.

During subsequent scheduled recrawls, only new and modified documents are crawled and any changes are reflected in your collection. Documents that are deleted from the external data source are not deleted from the collection.

All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.

The following table illustrates the objects that Discovery can crawl.

Data sources crawling support
Data source	Objects that are crawled
Microsoft SharePoint On Prem	SiteCollections, Sites, SubSites, Lists, List Items, Document Libraries, List Item Attachments

Data source requirements

In addition to the data source requirements for all managed deployments, your SharePoint On Prem data source must meet the following requirements:

You can connect to a SharePoint 2013, 2016, or 2019 on-premises data source.
The user ID must have SiteCollection Administrator permission and be able to access all of the sites and lists that they want to crawl.
The crawler supports Windows New Technology LAN Manager (NTLM) v1 authentication only. It does not support NTLM v2 or Security Assertion Markup Language (SAML) authentication.

What you need before you begin

You must have the following information ready. If you don't know it, ask your SharePoint administrator to provide the information or consult the Microsoft SharePoint developer documentation:

Username: The username to use to connect to the SharePoint On Prem web application that you want to crawl. For example, siteadmin01.
Password: The password to connect to the SharePoint On Prem web application that you want to crawl. This value is never returned and is only used when credentials are created or modified.
Web Application URL: The SharePoint web application URL. For example, https://sharepointwebapp.com:8443. If you do not enter a port number, the default value of 80 is used for an HTTP URL and 443 for HTTPS.
Domain: The domain name of the SharePoint On Prem account. For example, sharepoint.mycointernal.

Limitations

Discovery can connect only to SharePoint on-prem servers that IBM Cloud is able to access without any gateways.

Connecting to the data source

To configure the Microsoft SharePoint On Prem data source, complete the following steps in Discovery:

From the navigation pane, choose Manage collections.
Click New collection.
Click the link next to the Need to connect to a data source? field, click SharePoint On Prem, and then click Next.
Add values to the following fields:
- Username
- Password
- Web Application URL
- Domain
Click Next.
Name the collection.
If the language of the documents on the site is not English, select the appropriate language.

For a list of supported languages, see Language support.
Optional: Change the synchronization schedule.

For more information, see Crawl schedule options.
If you want to limit the types of files to add to the collection, you can list the file extensions for file types to either include or exclude.

When you choose to list extensions for file types to exclude, you must add at least one file extension.

For a list of supported file types, see Supported file types.
If you want the crawler to extract text from images on the site, expand More processing settings, and set Apply optical character recognition (OCR) to On.

When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.
Click Finish.

The collection is created quickly. It takes more time for the data to be processed as it is added to the collection.

If you want to check the progress, go to the Activity page. From the navigation pane, click Manage collections, and then click to open the collection.