Configuring the Data Crawler
The Data Crawler is no longer supported or available for download beginning 17 April 2019. This content is provided for existing installations only. See Connecting to Data Sources for other available connectivity options.
To set up the Data Crawler to crawl your repository, you must specify the appropriate input adapter in the crawler.conf
file, and then configure repository-specific information in the input adapter configuration files.
Before making the changes listed in these steps, make sure that you create your working directory by copying the contents of the {installation_directory}/share/examples/config
directory to a working directory on your system, for example
/home/config
.
Do not modify the provided configuration example files directly. Copy and then edit them. If you edit the example files in-place, your configuration might be overwritten when upgrading the Data Crawler, or it might be removed when uninstalling it.
References in this guide to files in the config
directory, such as config/crawler.conf
, refer to that file in your working directory, and NOT in the installed {installation_directory}/share/examples/config
directory.
The specified values are the defaults in config/crawler.conf
, and configure the Filesystem connector:
-
Open the
config/crawler.conf
file in a text editor.- Set the
crawl_config_file
option to the.conf
that you previously modified, for example:connectors/filesystem.conf
. - Set the
crawl_seed_file
option to the-seed.conf
that you previously modified, for example:seeds/filesystem-seed.conf
. -
Set the
output_adapter
class
andconfig
options for Discovery as follows:class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter" config - "discovery_service" discovery_service { include "discovery/discovery_service.conf" }
There are other optional settings in this file that might be set as appropriate to your environment. See Configuring crawl options, Configuring the input adapter, Configuring the output adapter, and Additional crawl management options for detailed information about setting these values.
- Set the
-
Open the
discovery/discovery_service.conf
file in a text editor. Modify the following values specific to Discovery you previously created on IBM Cloud®:environment_id
- Your Discovery environment ID.collection_id
- Your Discovery collection ID.configuration_id
- Your Discovery configuration ID.configuration
- The full path location of thisdiscovery_service.conf
file, for example,/home/config/discovery/discovery_service.conf
.username
- Username credential for your Discovery instance.apikey
- Credential for your Discovery instance.
There are other optional settings in this file that might be set as appropriate to your environment. See Configuring Service Options for detailed information about setting these values.
-
After modifying these files, you are ready to crawl your data. Proceed to Crawling your data repository to continue.
Configuring crawl options
The file config/crawler.conf
contains information that tells the Data Crawler which files to use for its crawl (input adapter), where to send the collection of crawled files after the crawl finishes (output adapter), and other crawl
management options.
All file paths are relative to the config
directory, except where noted.
To access the in-product manual for the crawler.conf
file, with the most up-to-date information, type the following command from the Crawler installation directory: man crawler.conf
The options that can be set in this file are:
Input adapter
class
- Internal use only; defines the Data Crawler input adapter class. The default value is:com.ibm.watson.crawler.connectorframeworkinputadapter.Crawl
-
config
- Internal use only; defines the connector framework configuration. The default configuration key within this block to pass to the chosen input adapter is:connector_framework
The connector framework is what allows you to talk to your data. It could be internal data within the enterprise, or it could be external data on the web or in the cloud. The connectors allow access to a number of different data sources, while connecting is actually controlled by the crawling process.
Data retrieved by the Connector Framework Input Adapter is cached locally. It is not stored encrypted. By default, the data is cached to a temporary directory that you must clear after performing a reboot and it must be readable only by the user who executed the crawler command.
There is a chance that this directory can outlive the crawler if the connector framework is removed before it can clean up after itself. Consider the location for your cached data. You can put data on an encrypted filesystem, but that might have performance implications. Pick the appropriate balance between speed and security for your crawls.
crawl_config_file
- The configuration file to use for the crawl. Default value is:connectors/filesystem.conf
crawl_seed_file
- The crawl seed file to use for the crawl. Default value is:seeds/filesystem-seed.conf
id_vcrypt_file
- Keyfile used for data encryption by the Crawler; the default key included with the crawler isid_vcrypt
. Use the vcrypt script in thebin
folder if you need to generate a newid_vcrypt
file.crawler_temp_dir
- The Crawler temporary folder for connector logs. Default value,tmp
, is provided. If it doesn't already exist, thetmp
folder is created in the current working directory.-
extra_jars_dir
- Adds a directory of extra JARs to the connector framework classpath.Relative to the connector framework
lib/java
directory.- This value must be
database
when using the Database connector.
You can leave this value empty (i.e., empty string "") when using other connectors.
- This value must be
-
urls_to_filter
- Blocklist of URLs that must not be crawled, in regular expression form. The Data Crawler does not crawl URLs that match any of the regular expressions provided.The
domain list
contains the domains that cannot be crawled. Add to it if necessary.The
filetype list
contains the file extensions that the Orchestration Service does not support.Remove any supported filetypes from the regular expressions.
Ensure that your seed URL domain is allowed by the filter. Use an empty filter for
allow everything
behavior.Ensure that your seed URL is not excluded by a filter, or the Crawler might hang.
-
max_text_size
- The maximum size, in bytes, that a document can be before it is written to disk by the Connector Framework. Adjusting this higher decreases the amount of documents written to disk, but increases the memory requirement. Default value is1048576
-
extra_vm_params
- Allows you to add extra Java parameters to the command used to launch the Connector Framework. -
bootstrap_logging
- Writes connector framework startup log; useful for advanced debugging only. Possible values aretrue
orfalse
. Log file is written tocrawler_temp_dir
-
read-timeout
- Sets the time in seconds that the crawler waits for a response from the connector framework. The default value is 5 seconds.
Output adapter
class
- Defines the Data Crawler output adapter class.-
config
- Defines which configuration key to pass to the output adapter. The string must correspond to a key within this configuration object. In the following code example:class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter" config - "discovery_service" discovery_service { include "discovery/discovery_service.conf" }
the configuration key is
discovery_service
.
You must select an output adapter by specifying its class
parameter and config
key.
- Discovery Service Output Adapter - Uploads crawled documents to the IBM Watson™ Discovery Service. Select this adapter by setting the
class
parameter andconfig
key as follows.
class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter"
config - "discovery_service"
discovery_service {
include "discovery/discovery_service.conf"
}
-
Test Output Adapter - The Test Output Adapter writes a representation of the crawled files to disk in a specified location. Select this adapter by setting the
class
parameter andconfig
key as follows.An additional parameter,
output_directory
, selects the directory to which the representation of the crawled data must be written.
class - "com.ibm.watson.crawler.testoutputadapter.TestOutputAdapter"
config - "test"
output_directory - "/tmp/crawler-test-output"`
-
retry
- Specifies the options for retry in case of failed attempts to push to the output adapter.max_attempts
- Maximum number of retry attempts. Default value is10
delay
- Minimum amount of delay between attempts, in seconds. Default value is2
exponent_base
- Factor that determines the growth of the delay time over each failed attempt. Default value is2
The formula is:
**`d(nth_retry) - delay * (exponent_base ^ nth_retry)`**
For example, the default settings with a delay of 1 second and an exponent base of 2 means that the second retry, or the third attempt, is delayed 2 seconds. The third retry is delayed 4 seconds, and so on.
`d(0) - 1 * (2 ^ 0)` - 1 second `d(1) - 1 * (2 ^ 1)` - 2 seconds `d(2) - 1 * (2 ^ 2)` - 4 seconds
So, with the default settings, a submission is attempted up to 10 times, waiting up to approximately 1022 seconds - a little more than 17 minutes. This time is approximate because there is additional time added to avoid having multiple resubmissions execute simultaneously. This "fuzzed" time is up to 10%, so the last retry in the previous example could delay up to 7.7 seconds. The wait time does not include the time spent connecting to the service, uploading data, or waiting for a response.
The
output_timeout
value takes precedence over the wait time here; if the total retry wait time exceeds that setting, the submission fails, even if it should have been retried.
Additional crawl management options
-
full_node_debugging
- Activates debugging mode; possible values aretrue
orfalse
.This puts the full data of every document crawled into the logs.
-
logging.log4j.configuration_file
* - The configuration file to use for logging. In the samplecrawler.conf
file, this option is defined inlogging.log4j
and its default value islog4j_custom.properties
. This option must be similarly defined whether using a.properties
or.conf
file. shutdown_timeout
- Specifies the timeout value, in minutes, before shutting down the application. Default value is10
.output_limit
- The highest number of indexable items that the Crawler attempts to send simultaneously to the output adapter. This can be further limited by the number of cores available to do the work. It says that, at any given point, there is no more than "x" indexable items sent to the output adapter waiting to return. Default value is10
.input_limit
- Limits the number of URLs that can be requested from the input adapter at one time. Default value is30
.-
output_timeout
- The amount of time, in seconds, before the Data Crawler gives up on a request to the output adapter, and then removes the item from the output adapter queue to allow more processing. Default value is1200
.Consider the constraints imposed by the output adapter, as those constraints might relate to the limits defined here. The defined
output_limit
only relates to how many indexable objects can be sent to the output adapter at once. Once an indexable object is sent to the output adapter, it is "on the clock," as defined by theoutput_timeout
variable. It is possible that the output adapter itself has a throttle preventing it from being able to process as many inputs as it receives. For instance, the orchestration output adapter might have a connection pool, configurable for HTTP connections to the service. If it defaults to 8, for example, and if you set theoutput_limit
to a number greater than 8, then there are processes, on the clock, waiting for a turn to execute. You might then experience timeouts. num_threads
- The number of parallel threads that can be run at one time. This value can be either an integer, which specifies the number of parallel threads directly, or it can be a string, with the format"xNUM"
, specifying the multiplication factor of the number of available processors, for example,"x1.5"
. The default value is"30"
Configuring service options
Discovery tells the crawler how to manage crawled files when using IBM Watson™ Discovery.
To access the in-product manual for the discovery-service.conf
file, with the most up-to-date information, type the following command from the Crawler installation directory:
man discovery_service.conf
Default options can be changed directly by opening the config/discovery/discovery_service.conf
file, and specifying the following values specific to your use case:
http_timeout
- The timeout, in seconds, for the document read/index operation; the default is125
.proxy_host_port
- (optional) When running the data crawler behind a firewall, you might need to set the proxy hostname and proxy port number in order for the data crawler to talk to Discovery. The default value for this option is an empty string, and if you need to change it, enter the value in the form of"{host}:{port}"
.-
concurrent_upload_connection_limit
- The number of simultaneous connections allowed for uploading documents. The default is2
.When using the Orchestration Service Output Adapter, this number must be greater than, or equal to, the
output_limit
set when configuring crawl options. -
base_url
- The URL to which your crawled documents are sent. environment_id
- The location of your crawled document collection at the base URL.collection_id
- Name of the document collection that you set up in Discovery.api_version
- Internal use only. Date of the last API version change.configuration_id
- The filename of the configuration file that Discovery uses.apikey
- Credential to authenticate to the location of your crawled document collection.
The Discovery Service Output Adapter can send statistics in order for IBM® to better understand and serve its users. The following options can be set for the send_stats
variable:
jvm
- Java Virtual Machine (JVM) statistics sent include the Java vendor and version, as reported by the JVM used to execute the data crawler. Value is eithertrue
orfalse
. Default value istrue
.os
- Operating system (OS) statistics sent include OS name, version, and architecture, as reported by the JVM used to execute the data crawler. Value is eithertrue
orfalse
. Default value istrue
.