IBM Cloud Docs
Building a Cloud Pak for Data custom crawler plug-in

Building a Cloud Pak for Data custom crawler plug-in

Discovery features the option to build your own crawler plug-in with a Java SDK. By using crawler plug-ins, you can now quickly develop relevant solutions for your use cases. You can download the SDK from your installed Discovery cluster. For more information, see Obtaining the crawler plug-in SDK package.

IBM Cloud Pak for Data IBM Cloud Pak for Data only

This information applies only to installed deployments.

Any custom code that you use with IBM Watson® Discovery is the responsibility of the developer; IBM Support does not cover any custom code that the developer creates.

The crawler plug-ins support the following functions:

  • Update the metadata list of a crawled document
  • Update the content of a crawled document
  • Exclude a crawled document
  • Reference crawler configurations, masking password values
  • Show notice messages in the Discovery user interface
  • Output log messages to the crawler pod console

However, the crawler plug-ins cannot support the following functions:

  • Split a crawled document into multiple documents
  • Combine content from multiple documents into a single document
  • Modify access control lists

Crawler plug-in requirements

Make sure that the following items are installed on the development server that you plan to use to develop a crawler plug-in by using this SDK:

  • Java SE Development Kit (JDK) 1.8 or higher
  • Gradle
  • cURL
  • sed (stream editor)

Obtaining the crawler plug-in SDK package

  1. Log in to your Discovery cluster.

  2. Enter the following command to obtain your crawler pod name:

    oc get pods | grep crawler
    

    The following example shows sample output.

    wd-discovery-crawler-57985fc5cf-rxk89     1/1     Running     0          85m
    
  3. Enter the following command to obtain the SDK package name, replacing {crawler-pod-name} with the crawler pod name that you obtained in step 2:

    oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ | grep wd-crawler-plugin-sdk
    

    The following example shows sample output.

    -rw-r--r--. 1 dadmin dadmin 35575 Oct  1 16:51 wd-crawler-plugin-sdk-${build-version}.zip
    
  4. Enter the following command to copy the SDK package to the host server, replacing {build-version} with the build version number from the previous step:

    oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/wd-crawler-plugin-sdk-${build-version}.zip wd-crawler-plugin-sdk.zip
    
  5. If necessary, copy the SDK package to the development server.

Building a crawler plug-in package

  1. Extract the SDK compressed file.
  2. Implement the plug-in logic in src/. Ensure that the dependency is written in build.gradle.
  3. Enter gradle packageCrawlerPlugin to create the plug-in package. The package is generated as build/distributed/wd-crawler-plugin-sample.zip.