Configuring connector and seed options
The Data Crawler is no longer supported or available for download beginning 17 April 2019. This content is provided for existing installations only. See Connecting to Data Sources for other available connectivity options.
When crawling data, the Crawler first identifies the type of data repository (connector) and the user-specified starting location (seed) to begin downloading information.
When using the Data Crawler, data repository security settings are ignored.
Seeds are the starting points of a crawl, and are used by the Data Crawler to retrieve data from the resource that is identified by the connector. Typically, seeds configure URLs to access protocol-based resources such as fileshares, SMB shares, databases, and other data repositories that are accessible by various protocols. Moreover, different seed URLs have different capabilities. Seeds can also be repository-specific, to enable crawling of specific third-party applications such as customer relationship management (CRM) systems, product life cycle (PLC) systems, content management systems (CMS), cloud-based applications, and web database applications.
To crawl your data correctly, you must ensure that the Crawler is properly configured to read your data repository. The Data Crawler provides connectors to support data collection from the following repositories:
- Filesystem
- Databases, via JDBC
- CMIS (Content Management Interoperability Services)
- SMB (Server Message Block), CIFS (Common Internet Filesystem), or Samba fileshares
A connector configuration template is also provided, which allows you to customize a connector.
To configure your connector do the following:
-
Open the file in the
connectors
directory that corresponds to the repository you are connecting to (for examplefilesystem.conf
is the configuration file for the filesystem connector) in a text editor. -
Modify the values that are appropriate to your repository:
- Save and close the file.
- Repeat for the
-seed.conf
file in theconnectors/seeds
directory that corresponds to the repository you are connecting to (for examplefilesystem-seed.conf
is the seed (where to connect to) configuration file for the filesystem connector) in a text editor. - Proceed to Configuring the Data Crawler.
To access the in-product manual for the connector and seed configuration files, with the most up-to-date information, type the following commands from the Crawler installation directory:
- For connector configuration options:
man crawler-options.conf
- For crawl seed configuration options:
man crawler-seed.conf
Configuring filesystem crawl options
The filesystem connector allows you to crawl files local to the Data Crawler installation.
Another option to upload large numbers of files into Discovery is discovery-files on GitHub.
Configuring the filesystem connector
Following are the basic configuration options that are required to use the filesystem connector. To set these values, open the file config/connectors/filesystem.conf
, and modify the following values specific to your use cases:
protocol
- The name of the connector protocol used for the crawl. Usesdk-fs
for this connector.collection
- This attribute is used to unpack temporary files. The default value iscrawler-fs
logging-config
- Specifies the file used for configuring logging options; it must be formatted as alog4j
XML string.classname
- Java class name for the connector. The value to use this connector must beplugin:filesystem.plugin@filesystem
.
Configuring the filesystem crawl seed
The following values can be configured for the filesystem crawl seed file. To set these values, open the file config/seeds/filesystem-seed.conf
and specify the following values specific to your use cases:
-
url
- Newline-separated list of files and folders to crawl. UNIX users can use a path such as/usr/local/
.The URLs must start with
sdk-fs://
. So to crawl, for example,/home/watson/mydocs
, the value of this URL issdk-fs:///home/watson/mydocs
. The third/
in the URL is necessary.The filesystems used by Linux, UNIX, and UNIX-like computer systems can contain special types of files, such as block and character device nodes and files that represent named pipes, which cannot be crawled because they do not contain data, but serve as device or I/O access points. Attempting to crawl such files generate errors during the crawl. To avoid such errors, exclude the
/dev
directory in any top-level crawl on a Linux, UNIX, or UNIX-like filesystem. If present on the system that you are crawling, also exclude temporary system directories, such as/proc
,/sys
, and/tmp
, that contain transient files and system information.hops
- Internal use only.default-allow
- Internal use only.
Configuring Database crawl options
The database connector allows you to crawl a database by executing a custom SQL command and creating one document per row (record) and one content element per column (field). You can specify a column to be used as a unique key, as well as a column containing a timestamp representing the last-modification date of each record. The connector retrieves all records from the specified database, and can also be restricted to specific tables, joins, and so on in the SQL statement.
The Database connector allows you to crawl the following databases:
- IBM® DB2
- MySQL
- Oracle PostgreSQL
- Microsoft SQL Server
- Sybase
- Other SQL-compliant databases, via a JDBC 3.0-compliant driver
The connector retrieves all records from the specified database and table.
JDBC Drivers - The Database connector ships with Oracle JDBC (Java Database Connectivity) driver version 1.5. All third-party JDBC drivers shipped with the Data Crawler are located in the connectorFramework/crawler-connector-framework-#.#.#/lib/java/database
directory of your Data Crawler installation, where you can add, remove, and modify them as necessary. You can also use the extra_jars_dir
setting in the crawler.conf
file to specify another location.
DB2 JDBC Drivers - The Data Crawler does not ship with the JDBC drivers for DB2 due to licensing issues. However, all DB2 installations in which you installed JDBC support include the JAR files that the Data Crawler
requires, to crawl a DB2 installation. To crawl a DB2 instance, you must copy these files into the appropriate directory in your Data Crawler installation so that the Database connector can use them. To enable the Data Crawler to crawl a DB2
installation, locate the db2jcc.jar
and license (typically, db2jcc_license_cu.jar
) JAR files in your DB2 installation, and copy those files to the connectorFramework/crawler-connector-framework-#.#.#/lib/java/database
subdirectory of your Data Crawler installation directory, or you can use the extra_jars_dir
setting in the crawler.conf
file to specify another location.
MySQL JDBC Drivers - The Data Crawler does not ship with the JDBC drivers for MySQL because of possible license issues if they were delivered as part of the product. However, downloading the JAR file that contains the MySQL JDBC drivers and integrating that JAR file into your Data Crawler installation is quite easy to do:
- Use a web browser to visit the MySQL download site, and locate the source and binary download link for the archive format that you want to use (typically zip for Microsoft Windows systems or a gzipped tarball for Linux systems). Click that link to initiate the download process. Registration might be required.
- Use the appropriate unzip archive-file-name or tar zxf archive-file-name command to extract the contents of that archive, based on the type and name of the archive file that you download.
- Change to the directory that was extracted from the archive file, and copy the JAR file from this directory to the
connectorFramework/crawler-connector-framework-#.#.#/lib/java/database
subdirectory of your Data Crawler installation directory, or you can use theextra_jars_dir
setting in thecrawler.conf
file to specify another location.
Configuring the Database Connector
Following are the basic configuration options that are required to use the Database connector. To set these values, open the file config/connectors/database.conf and modify the following values specific to your use cases:
protocol
- The name of the connector protocol used for the crawl. The value for this connector is based on the database system to be accessed.collection
- This attribute is used to unpack temporary files.classname
- Java class name for the connector. The value to use this connector must beplugin:database.plugin@database
.logging-config
- Specifies the file used for configuring logging options; it must be formatted as alog4j
XML string.
Configuring the Database Crawl Seed
The following values can be configured for the Database crawl seed file. To set these values, open the file config/seeds/database-seed.conf
and specify the following values specific to your use cases:
-
url
- The URL of the table or view to retrieve. Defines your custom SQL database seed URL. The structure is:database-system://host:port/database?[per=num]&[sql=SQL]
Testing a seed URL shows all of the enqueued URLs. For example, testing the following URL for a database containing 200 records:
sqlserver://test.mycompany.com:1433/WWII_Navy/?per=100&sql=select%20*%20from%20vessel&
shows the following enqueued URLs:
sqlserver://test.mycompany.com:1433/WWII_Navy/?key-val=0&
sqlserver://test.mycompany.com:1433/WWII_Navy/?key-val=100&
sqlserver://test.mycompany.com:1433/WWII_Navy/?key-val=200&
While testing the following URL shows the data retrieved from row 43:
sqlserver://test.mycompany.com:1433/WWII_Navy/?per-1&key-val=43
hops
- Internal use only.default-allow
- Internal use only.user-password
- Credentials for the database system. The username and password need to be separated by a:
, and the password must be encrypted using the vcrypt program shipped with the Data Crawler. For example:username:[[vcrypt/3]]passwordstring
max-data-size
- Maximum size of the data for a document. This is the largest block of memory that is loaded at one time. Only increase this limit if you have sufficient memory on your computer.filter-exact-duplicates
- Internal use only.timeout
- Internal use only.jdbc-class
(Extender option) - When specified, this string overrides the JDBC class used by the connector when(other)
is chosen as the database system.connection-string
(Extender option) - When specified, this string overrides the automatically generated JDBC connection string. This allows you to provide more detailed configuration about the database connection, such as load-balancing or SSL connections. For example:jdbc:netezza://127.0.0.1:5480/databasename
save-frequency-for-resume
(Extender option) - Specifies the name of a column or associated label, to resume a crawl or do a partial refresh. The seed saves the name of this column at regular intervals as it proceeds with the crawl and saves it again, when the last row of your database is crawled. When resuming the crawl or refreshing it, the crawl begins with the row that is identified in the saved value for this field
Configuring CMIS crawl options
The CMIS (Content Management Interoperability Services) connector lets you crawl CMIS-enabled CMS (Content Management System) repositories, such as Alfresco, Documentum or IBM® Content Manager, and index the data that they contain.
Configuring the CMIS Connector
Following are the basic configuration options that are required to use the CMIS connector. To set these values, open the file config/connectors/cmis.conf
and specify the following values specific to your use cases:
protocol
- The name of the connector protocol used for the crawl. The value to use this connector must becmis
.collection
- This attribute is used to unpack temporary files.dns
- Unused option.classname
- Java class name for the connector. Useplugin:cmis-v1.1.plugin@connector
for this connector.logging-config
- Specifies the file used for configuring logging options; it must be formatted as alog4j
XML string.endpoint
- The service endpoint URL of a CMIS-compliant repository.username
- The user name of the CMIS repository user used to access the content. This user must have access to all the target folders and documents to be crawled and indexed.password
- Password of the CMIS repository used to access the content. Password must not be encrypted; it must be given in plain text.repositoryid
- The ID of the CMIS repository used to access the content for that specific repository.bindingtype
- Identifies what type of binding is to be used to connect to a CMIS repository. Value is either AtomPub or WebServices.authentication
- Identifies what type of authentication mechanism to use while contacting a CMIS-compatible repository: Basic HTTP Authentication, NTLM, or WS-Security(Username token).enable-acl
- Enables retrieving ACLs for crawled data. If you are not concerned about security for the documents in this collection, disabling this option increases performance by not requesting this information with the document and not retrieving and encoding this information. Value is either true or false.user-agent
- A header sent to the server when crawling documents.method
- The method (GET or POST) by which parameters are passed.-
url-logging
- The extent to which crawled URLs are logged. Possible values are:full-logging
- Log all information about the URL.refined-logging
- Only log the information necessary to browse the crawler log and for the connector to function correctly; this is the default value.minimal-logging
- Only log the minimum amount of information necessary for the connector to function correctly.
Setting this option to
minimal-logging
reduces the size of the logs and gain a slight performance increase due to the smaller I/O associated with minimizing the amount of data that is being logged. ssl-version
- Specifies a version of SSL to use for HTTPS connections. By default the strongest protocol available is used.
Configuring the CMIS Crawl Seed
The following values can be configured for the CMIS crawl seed file. To set these values, open the file config/seeds/cmis-seed.conf
and modify the following values specific to your use cases:
-
url
- The URL of a folder from the CMIS repository to be used as a starting point of the crawl, for example:cmis://alfresco.test.com:8080/alfresco/cmisatom?folderToProcess-workspace://SpacesStore/guid
To crawl from the root folder, you need to give the URL as:
cmis://alfresco.test.com:8080/alfresco/cmisatom?folderToProcess-
at
- Unused option.default-allow
- Internal use only.
Configuring SMB/CIFS/Samba crawl options
The Samba connector allows you to crawl Server Message Block (SMB) and Common Internet filesystem (CIFS) fileshares. This type of fileshare is common on Windows networks, and is also provided through the open source project Samba.
Configuring the Samba Connector
Following are the basic configuration options that are required to use the Samba connector. To set these values, open the file config/connectors/samba.conf
and specify the following values specific to your use cases:
protocol
- The name of the connector protocol used for the crawl. The value to use this connector is smb.collection
- This attribute is used to unpack temporary files.classname
- Java class name for the connector. The value to use this connector must beplugin:smb.plugin@connector
.logging-config
- Specifies the file used for configuring logging options; it must be formatted as alog4j
XML string.username
- The Samba username to authenticate with. If provided, domain and password must also be provided. If not provided, the guest account is used.password
- The Samba password to authenticate with. If the username is provided, this is required. Password must be encrypted using the vcrypt program shipped with the Data Crawler.archive
- Enables the Samba connector to crawl and index files that are compressed within archive files. Value is either true or false; default value is false.max-policies-per-handle
- Specifies the maximum number of Local Security Authority (LSA) policies that can be opened for a single RPC handle. These policies define the access permissions that are required to query or modify a particular system under various conditions. The default value for this option is 255.crawl-fs-metadata
- Turning on this option causes the Data Crawler to add a VXML document containing the available filesystem metadata about the file (creation date, last modified date, file attributes, etc.).enable-arc-connector
- Unused option.-
disable-indexes
- Newline-separated list of indexes to disable, which might result in a faster crawl, for example:disable-url-index
disable-error-state-index
disable-crawl-time-index
exact-duplicates-hash-size
- Sets the size of the hash table used for resolving exact duplicates. Be very careful when modifying this number. The value that you select must be prime, and larger sizes can provide faster lookups but require more memory, while smaller sizes can slow down crawls but substantially reduce memory usage.user-agent
- Unused option.timeout
- Unused optionn-concurrent-requests
- The number of requests that are sent in parallel to a single IP address. The default is1
.enqueue-persistence
- Unused option.
Configuring the Samba Crawl Seed
The following values can be configured for the Samba crawl seed file. To set these values, open the file config/seeds/samba-seed.conf
and specify the following values specific to your use cases:
-
url
- A newline-separated list of shares to crawl, for example:smb://share.test.com/office smb://share.test.com/cash/money/change smb://share.test.com/C$/Program Files
-
hops
- Internal use only. default-allow
- Internal use only.