External enrichment API
The external enrichment feature is not supported in the Analyze API.
The external enrichment feature allows you to annotate documents with a model of your choice. Through a webhook interface, you can use custom models or advanced foundation models, and other third-party models for enriching your documents in a collection. The documents are enriched by your external application and then merged to a collection in a Discovery project.
IBM Cloud Pak for Data When you run Discovery in an air-gapped environment, you must connect to the external application through an HTTP proxy. For more information, see Setting up HTTP proxy in air-gapped environments.
For using the external enrichment feature, do the following things:
-
Set up the external application that can receive webhook notifications from Discovery and annotate documents.
To do so, you must register your external app as a webhook endpoint on a project by using the
create enrichment
method. For more information, see Create enrichment in the API reference.After setting up the external enrichment for a project, it becomes available to all collections in the project. The external application also receives a webhook
ping
event, which notifies that an external enrichment is created. -
Specify the collection in which you want to apply the external enrichment. You can use the API to apply the external enrichment to a collection. For more information, see Using the API to manage enrichments.
Alternatively, on the user interface, you can browse to the Manage collections page, and choose the collection where you want to apply the external enrichment. Then, open the Enrichments tab, and apply your external enrichment to a field in the collection.
When documents are processed or uploaded to this collection, Discovery creates a batch of documents with a unique
batch_id
. The external application also receives a webhookenrichment.batch.created
event, which notifies that batches are ready to be pulled. Your external application can then pull batches from Discovery for external enrichment.If the external application shuts down or restarts in between, you can get the following by using the List batches method:
- Notified batches that are not yet pulled by the external enrichment application.
- Batches that are pulled, but not yet pushed to Discovery by the external enrichment application.
For more information, see List batches in the API reference.
-
Specify the
batch_id
provided by Discovery in thepull batches
method to pull the documents from Discovery for enrichment by your external application. For more information, see Pull batches in the API reference.The
pull batches
method returns a binary file attachment from Discovery. For more information about the binary attachment, see Binary attachment from the pull batches method. -
Specify the same
batch_id
in thepush batches
method after your external enrichment annotates the documents in the batch. For more information, see Push batches in the API reference.The documents are pushed to Discovery as a binary attachment. For more information, see Binary attachment in the push batches method.
-
Verify that the documents are merged and indexed in the collection. The documents must contain the annotations that are applied by your external application.
Webhook security
To authenticate the webhook request, verify the JSON Web Token (JWT) that is sent with the request. The webhook microservice automatically generates a JWT and sends it in the Authorization
header with each webhook call. It is your
responsibility to add code to the external service that verifies the JWT.
The system can generate a JWT based on the sample secret
that you specify, and in the Authorization
header, you can pass this system-generated JWT to the external application. If you specify a value in the header
,
then the webhook microservice sends that value to the external application instead of the JWT.
For example, if you specify sample secret
in the Secret
field of the Webhooks object in the Create collection or update collection APIs, you might add sample code such as the following in Node.js:
const jwt = require('jsonwebtoken');
...
const token = request.headers.authentication; // grab the "Authentication" header
try {
const decoded = jwt.verify(token, 'sample secret');
} catch(err) {
// error thrown if token is invalid
}
Data model of the ping
event
Following are the ping
event parameters:
Parameter | Description |
---|---|
event |
The event name is ping . |
instance_id |
The Discovery instance ID. |
version |
The Discovery API version in the format yyyy-mm-dd . |
data |
An object with the event information:
|
created_at |
The date and time the event was created. |
Data model of the enrichment.batch.created
event
Following are the enrichment.batch.created
event parameters:
Parameter | Description |
---|---|
event |
The event name is enrichment.batch.created . |
instance_id |
The UUID of the Discovery instance, which is also known as the tenant ID. |
version |
The webhook event version date in the yyyy-mm-dd format. |
data |
An object with the event specific information:
|
created_at |
The date and time the event was created. |
External enrichment limits
Plan | Maximum amount of webhook enrichment per collection | Maximum amount of webhook enrichment per tenant |
---|---|---|
Enterprise | 1 | 100 |
Plus | 1 | 10 |
Premium | 1 | 100 |
Binary attachment from the pull batches method
The pull batches
method returns a binary attachment file from Discovery.
The returned file is a compressed newline-delimited JSON (NDJSON) file. This file contains structured data that represents the document properties. For example, the following is a JSON value included in the NDJSON file:
{
"document_id": "3bafc09abfaacd90d66f57181b50d041",
"location_encoding": "utf-16",
"language": "en",
"artifact": "{\"text_positions\":[0,21],\"space_above\":93.07864284515381,\"space_below\":32.53530788421631,\"is_start_of_block\":true,\"image_id\":-1}{\"text_positions\":[22,63],\"space_above\":32.53530788421631,\"space_below\":13.935576438903809,\"is_start_of_block\":true,\"image_id\":-1}{\"parent_document_id\":\"3bafc09abfaacd90d66f57181b50d041\",\"source\":{\"ListId\":\"f0ac1d32-b9e5-41af-b9da-e1e37e965d99\",\"UniqueId\":\"357d7a48-4460-442c-be56-d8bdd40a8c36\",\"ServerRelativeUrl\":\"/Lists/list1/Attachments/1/addattachments.csv\",\"FileNameAsPath\":{\"DecodedUrl\":\"addattachments.csv\"},\"ListItemId\":\"284dcb51-8021-56d0-9213-7f4eb134e083\",\"FileName\":\"addattachments.csv\",\"ServerRelativePath\":{\"DecodedUrl\":\"/Lists/list1/Attachments/1/addattachments.csv\"},\"WebId\":\"ad5bf592-3b4e-4dd1-bd3e-abc0ef179b03\"},\"ingest_datetime\":\"2023-06-26T09:24:02.573Z\",\"application_id\":\"sharepoint\",\"application_sub_type\":\"ListItemAttachmentCollection\"}0.51vanilla ice creamcontamination_tamperingotherchange_of_propertiesI love the ads for the new milk chocolate. Could you tell me the name of the actor in the commercial?{\"metadata\":{\"numPages\":\"54\",\"title\":\"\",\"publicationdate\":\"2010-06-03\"},\"info\":{\"histogram\":{\"mean-char-height\":{},\"mean-char-width\":{},\"number-of-chars\":{}},\"styles\":[]}}1451692800000",
"features": [
{
"type": "field",
"location": {
"begin": 0,
"end": 128
},
"properties": {
"field_name": "multi_nested",
"field_index": 0,
"field_type": "json"
}
},
{
"type": "field",
"location": {
"begin": 128,
"end": 258
},
"properties": {
"field_name": "multi_nested",
"field_index": 1,
"field_type": "json"
}
},
{
"type": "field",
"location": {
"begin": 258,
"end": 889
},
"properties": {
"field_name": "metadata",
"field_index": 0,
"field_type": "json"
}
},
{
"type": "field",
"location": {
"begin": 889,
"end": 892
},
"properties": {
"field_name": "claim_score",
"field_index": 0,
"field_type": "double"
}
},
{
"type": "field",
"location": {
"begin": 892,
"end": 893
},
"properties": {
"field_name": "claim_id",
"field_index": 0,
"field_type": "long"
}
},
{
"type": "field",
"location": {
"begin": 893,
"end": 910
},
"properties": {
"field_name": "claim_product",
"field_index": 0,
"field_type": "string"
}
},
{
"type": "field",
"location": {
"begin": 910,
"end": 933
},
"properties": {
"field_name": "label",
"field_index": 0,
"field_type": "string"
}
},
{
"type": "field",
"location": {
"begin": 933,
"end": 938
},
"properties": {
"field_name": "label",
"field_index": 1,
"field_type": "string"
}
},
{
"type": "field",
"location": {
"begin": 938,
"end": 958
},
"properties": {
"field_name": "label",
"field_index": 2,
"field_type": "string"
}
},
{
"type": "field",
"location": {
"begin": 958,
"end": 1059
},
"properties": {
"field_name": "body",
"field_index": 0,
"field_type": "string"
}
},
{
"type": "field",
"location": {
"begin": 1059,
"end": 1230
},
"properties": {
"field_name": "nested",
"field_index": 0,
"field_type": "json"
}
},
{
"type": "field",
"location": {
"begin": 1230,
"end": 1243
},
"properties": {
"field_name": "claim_date",
"field_index": 0,
"field_type": "date"
}
}
]
}
Following are the binary file properties:
Property | Type | Description |
---|---|---|
document_id |
string |
The identifier of the document. |
location_encoding |
string |
The encoding type used to calculate the location of each feature. The supported types are: utf-8 , utf-16 , and utf-32 . The external enrichment application must calculate the location of each feature
based on the location_encoding of the corresponding document from Discovery. The location of features in a string representation of data varies depending on the encoding type of the programming language that is used for
implementing the external enrichment. For example, C++ and Go use UTF-8, Java and JavaScript use UTF-16, and Python uses UTF-32. |
language |
string |
The content language of the document. |
artifact |
string |
The package of all the text values. |
features |
array |
The list of features in a document. For more information, see Feature types. |
Binary attachment in the push batches method
After external enrichment, the documents can be pushed to Discovery as a binary attachment in the push batches
method.
The file must be a compressed NDJSON file with structured data that represents the document properties. For example, the following is an NDJSON file:
{
"document_id": "3bafc09abfaacd90d66f57181b50d041",
"features": [
{
"type": "annotation",
"location": {
"begin": 958,
"end": 1000
},
"properties": {
"type": "element_classes",
"class_name": "expression",
"confidence": 0.7905777096748352
}
},
{
"type": "annotation",
"location": {
"begin": 1001,
"end": 1059
},
"properties": {
"type": "element_classes",
"class_name": "question",
"confidence": 0.9507029056549072
}
},
{
"type": "annotation",
"location": {
"begin": 1035,
"end": 1040
},
"properties": {
"type": "entities",
"entity_type": "JobTitle",
"entity_text": "actor",
"confidence": 0.70953685
}
},
{
"type": "annotation",
"properties": {
"type": "document_classes",
"class_name": "amount.shortage",
"confidence": 0.43297016620635986
}
},
{
"type": "notice",
"properties": {
"description": "something wrong happened",
}
},
{
"type": "notice",
"properties": {
"description": "something wrong happened again",
"created": 1689076276402,
}
}
]
}
Following are the binary file properties:
Property | Type | Description |
---|---|---|
document_id |
string |
The identifier of the document. |
features |
array |
The list of features in a document. For more information, see Feature types. |
Feature types
A feature type
can be one of the following in a binary file:
Feature | Type | Description |
---|---|---|
field |
string |
Represents a specific field value of the document. |
annotation |
string |
Represents a specific annotation that can enrich the document. |
notice |
string |
Represents any error that might occur in the external application during document enrichment. The information in notice is used to generate a message on the Discovery UI. |
The following are the other properties in the binary file:
Feature | Type | Description |
---|---|---|
location |
object |
Location information to get the text value from the artifact by using the begin and end values. The begin value is a string value that represents the begin location in the artifact. The
end value is a string value that represents an exclusive end location in the artifact. This property is null when a feature represents a document level information. For example, when type=annotation and properties.type=document_classes . |
properties |
object |
The properties of a feature in the document. Supported properties vary depending on the type of feature. For more information, see Field type properties, Annotation type properties,
and Notice type properties. |
Field type properties
For field
type, the following properties represent a certain field of the document that was converted by Discovery from an original file:
Property | Type | Description |
---|---|---|
field_name |
string |
The name of the field. |
field_index |
int |
The index of a field value. This value is 0 for a single-valued field, but can be > 0 when a field is multi-valued, such as, for an array of values. |
field_type |
string (enum: long , double , date , json ) |
The data type of the feature. This value determines how to parse the text representation of the feature in a programming language. |
Annotation type properties
For annotation
type, the following properties represent an annotation that can enrich a document:
Property | Type | Description |
---|---|---|
type |
string (enum: entities , element_classes , document_classes ) |
The type of enriched annotation that a feature represents. The entities are merged to entities of enriched fields. The element_classes are merged to element classes of enriched fields. The document_classes are merged to classes of document level enrichment field. |
confidence |
double |
The optional confidence score by the external model. It is between 0 to 1 , and is 0 by default. |
entity_type |
string |
The type of entity that an external model assigns to a thing. Required for the entities type. |
entity_text |
string |
The representative text of an entity that the external application extracts. Required for the entities type. |
class_name |
string |
The name of a class that the external application assigns to a thing. Required for the element_classes and document_classes type. |
Notice type properties
For notice
type, the following properties represent errors and exceptions that occurred in the external application while enriching a document:
Property | Type | Description |
---|---|---|
description |
string |
The message that describes an error that occurred during external enrichment. |
created |
long |
Unix epoch time in milliseconds when an error occurred during external enrichment. |