In an upcoming release, the bundled JVM for the crawler plug-in and customer connector features will be transitioned to IBM Semeru Runtimes, Version 21. If your crawler plug-in or custom connectors utilize any features that are incompatible between IBM SDK, Java Technology Edition, Version 8 and IBM Semeru Runtimes, Version 21, you need to revise your code to ensure compatibility with future releases, such as Discovery 5.2.x and later.
For JVM migration, see the following pages:
- https://www.ibm.com/support/pages/semeru-runtimes-migration-guide
- https://www.ibm.com/support/pages/semeru-runtimes-security-migration-guide
Building a custom crawler plug-in
Discovery features the option to build your own crawler plug-in with a Java SDK. By using crawler plug-ins, you can now quickly develop relevant solutions for your use cases. You can download the SDK from your installed Discovery cluster. For more information, see Obtaining the crawler plug-in SDK package.
IBM Cloud Pak for Data IBM Software Hub
This information applies only to installed deployments.
Any custom code that you use with IBM Watson® Discovery is the responsibility of the developer; IBM Support does not cover any custom code that the developer creates.
The crawler plug-ins support the following functions:
- Update the metadata list of a crawled document
- Update the content of a crawled document
- Exclude a crawled document
- Reference crawler configurations, masking password values
- Show notice messages in the Discovery user interface
- Output log messages to the
crawler
pod console
However, the crawler
plug-ins cannot support the following functions:
- Split a crawled document into multiple documents
- Combine content from multiple documents into a single document
- Modify access control lists
Crawler plug-in requirements
Make sure that the following items are installed on the development server that you plan to use to develop a crawler
plug-in by using this SDK:
- Java SE Development Kit (JDK) 1.8 or higher
- Gradle
- cURL
- sed (stream editor)
Obtaining the crawler plug-in SDK package
-
Log in to your Discovery cluster.
-
Enter the following command to obtain your
crawler
pod name:oc get pods | grep crawler
The following example shows sample output.
wd-discovery-crawler-57985fc5cf-rxk89 1/1 Running 0 85m
-
Enter the following command to obtain the SDK package name, replacing
{crawler-pod-name}
with thecrawler
pod name that you obtained in step 2:oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ | grep wd-crawler-plugin-sdk
The following example shows sample output.
-rw-r--r--. 1 dadmin dadmin 35575 Oct 1 16:51 wd-crawler-plugin-sdk-${build-version}.zip
-
Enter the following command to copy the SDK package to the host server, replacing
{build-version}
with the build version number from the previous step:oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/wd-crawler-plugin-sdk-${build-version}.zip wd-crawler-plugin-sdk.zip
-
If necessary, copy the SDK package to the development server.
Building a crawler plug-in package
- Extract the SDK compressed file.
- Implement the plug-in logic in
src/
. Ensure that the dependency is written inbuild.gradle
. - Enter
gradle packageCrawlerPlugin
to create the plug-in package. The package is generated asbuild/distributed/wd-crawler-plugin-sample.zip
.