Configuration reference
You can create your own Discovery ingestion configuration in JSON if your data has special conversion, enrichment, or normalization needs.
The following sections detail the structure of this JSON and the object that can be defined in it. Also see the Configurations section of the API reference.
If you configure your collection using Smart Document Understanding, the PDF and Word conversion settings listed are not used, so changing these conversion settings are ignored.
Configuration structure
A Discovery configuration is structured as follows:
{
"name": "Configuration Name",
"description": "Descriptive text about the configuration",
"conversions": {
"word": {},
"pdf": {},
"html": {},
"segment": {},
"json_normalizations": []
},
"enrichments": [],
"normalizations": []
}
The base JSON object contains the following items:
"name": "Configuration Name"
- The name of your configuration"description": "Descriptive text about the configuration"
- A description of your configuration
The following objects and arrays must be defined to convert, enrich and normalize documents that are uploaded to your collection.
"conversions": {}
- How documents are transformed into JSON that can be enriched."enrichments": []
- Which enrichments are applied to which parts of the JSON."normalizations": []
- Any post enrichment adjustments that are required before the document is stored.
Additionally, the following items are added to the base object by Discovery when the configuration is created/updated:
{
"configuration_id": "4f5b7c7b-ebf4-4963-882e-27eff08f08e3",
"created": "2017-09-13T14:45:03.575Z",
"updated": "2017-09-13T14:45:03.575Z"
}
Conversion
Converting documents takes the original source format and using one or more steps transforms it into JSON that can be used for the rest of the ingestion process. Depending on the type of file uploaded, the process is as follows:
-
PDF files are converted to HTML using the
pdf
options, the resulting HTML is then converted to JSON using thehtml
options, and finally the resulting JSON is converted using thejson
options. -
Microsoft Word files are converted to HTML using the
word
options, the resulting HTML is then converted to JSON using thehtml
options, and finally the resulting JSON is converted using thejson
options. -
HTML files are converted to JSON using the
html
options, and the resulting JSON is converted using thejson
options. -
JSON files are converted using the
json
options.
If you configure your collection using Smart Document Understanding, the PDF and Word conversion settings listed are not used, so changing these conversion settings are ignored.
These options are described in the following sections. After conversion completes, enrichment and normalization are performed before the content is stored.
If you configure your collection using Smart Document Understanding, the PDF and Word conversion settings listed are not used, so changing these conversion settings are ignored.
The pdf
conversion object defines the conversion from PDF to HTML and has the following structure:
"pdf": {
"heading": {
"fonts": [
{
"level": 1,
"min_size": 24,
"max_size": 80,
"bold": false,
"italic": true,
"name": "arial"
},
{
"level": 2,
"min_size": 18,
"max_size": 24,
"bold": true,
"italic": false,
"name": "ariel"
}
]
}
},
When converting PDF files, headings in those files can be identified and converted into an appropriate HTML "h
" tag by identifying the size, font, and style of each heading level. Heading levels can be specified multiple
times, if necessary, to correctly identify all relevant sections. HTML heading levels are important to identify, if you plan on extracting the contents, using CSS selectors, or if you intend to split the document, using document splitting.
The heading
object contains the fonts
array and each item in that array specifies a heading level, using the following parameters:
"level": INT
- required - the HTMLh
level that text identified with these parameters are converted into."min_size": INT
- optional - the smallest font size that is identified as this heading level."max_size": INT
- optional - the largest font size that is identified as this heading level."bold": boolean
- optional - Whentrue
only bold fonts are identified as this heading level."italic": boolean
- optional - Whentrue
only italic fonts are identified as this heading level."name": "string"
- optional - The name of the font that is identified as this heading level.
For an area of text to be identified as a heading it must match all of the parameters defined in the specific array item. If the parameters defined are too flexible, you might have more headings identified than you anticipated. Your results are better if you strictly define each heading level multiple times to ensure all heading level variations are included without invalid matches.
Word
If you configure your collection using Smart Document Understanding, the PDF and Word conversion settings listed are not used, so changing these conversion settings are ignored.
The word
conversion object defines how to convert PDF documents into HTML and has the following structure:
"word": {
"heading": {
"fonts": [
{
"level": 1,
"min_size": 24,
"bold": true,
"italic": false,
"name": "arial"
},
{
"level": 2,
"min_size": 18,
"max_size": 23,
"bold": true,
"italic": false
}
],
"styles": [
{
"level": 1,
"names": [
"pullout heading",
"pulloutheading",
"header"
]
},
{
"level": 2,
"names": [
"subtitle"
]
}
]
}
},
The Microsoft Word conversion object works in a similar way to the PDF conversion object. However, there are two different arrays that can be specified inside the heading
object when extracting headings from Microsoft Word documents.
You can use either or both heading extraction arrays to extract heading level elements from your Microsoft Word documents.
Each item in the fonts
array specifies a heading level using the following parameters using font characteristics:
"level": INT
- required - the HTMLh
level that text identified with these parameters are converted into."min_size": INT
- optional - the smallest font size that is identified as this heading level."max_size": INT
- optional - the largest font size that is identified as this heading level."bold": boolean
- optional - Whentrue
only bold fonts are identified as this heading level."italic": boolean
- optional - Whentrue
only italic fonts are identified as this heading level."name": "string"
- optional - The name of the font that is identified as this heading level.
For an area of text to be identified as a heading in the font
array it must match all of the parameters defined in the specific array item. If the parameters defined are too flexible, you might have more headings identified than
you anticipated. Your results are better if you strictly define each heading level multiple times to ensure all heading level variations are included without invalid matches.
Each item in the styles
array specifies a heading level from the Microsoft Word styles that are applied to that paragraph.
"level": INT
- required - the HTMLh
level that text identified with these parameters are converted into."names": array
- required - a comma separated array of style names that is identified as this heading level.
HTML
"html": {
"exclude_tags_completely": [
"script",
"sup"
],
"exclude_tags_keep_content": [
"font",
"em"
],
"exclude_content": {
"xpaths": [
"//*[@id='list-old']",
"//*[@id='unstable']"
]
},
"keep_content": {
"xpaths": [
"//*[@id='footer']",
"//*[@id='header']"
]
},
"exclude_tag_attributes": [
"EVENT_ACTIONS"
],
"keep_tag_attributes": [
"id"
],
"extracted_fields": {
"{field_name}": {
"css_selector": "{CSS_selector_expression_1}",
"type": "{field_type}"
}
}
},
exclude_tags_completely
"exclude_tags_completely" : array
- An array of HTML tag names that are excluded. This includes the tag, the content and any tag attributes that are defined.
exclude_content
"xpaths" : array
- An array of XPaths that identify content that is removed. If this value is set, anything that matches one of the XPaths are removed from the output.
keep_content
"xpaths" : array
- An array of XPaths that identify content that is converted. If this value is set, anything that matches one of the XPaths are included in the output. The inclusions specified by this parameter are
processed after any processing specified by exclude_content
.
exclude_tag_attributes
"exclude_tag_attributes" : array
- An array of HTML attribute names that are removed by the conversion regardless of which HTML tag they are present in.
You receive an error message if you specify both exclude_tag_attributes
and keep_tag_attributes
in the same configuration - only one can be specified per configuration. If present, keep_tag_attributes
must be completely removed from the configuration; it cannot be present as an empty array.
keep_tag_attributes
"keep_tag_attributes" : array
- An array of HTML attribute names that are retained by the conversion.
You receive an error message if you specify both keep_tag_attributes
and exclude_tag_attributes
in the same configuration - only one can be specified per configuration. If present, exclude_tag_attributes
must be completely removed from the configuration; it cannot be present as an empty array.
extracted_fields
This object defines any content from the HTML source that is to be extracted into a separate JSON field as part of the conversion. The content is identified by using CSS selectors.
Each field that you want to create is defined by an object as follows:
"{field_name}": {
"css_selector": "{CSS_selector_expression_1}",
"type": "{field_type}"
}
"{field_name}"
- The name of the field to be created.
Field names defined in your configuration must meet the restrictions defined in Field Name Requirements.
"css_selector" : string
required - a CSS selector expression that defines the area of content to be stored in a field."type" : string
required - The type of field to be created, can bestring
,date
For detailed information, see Using CSS selectors to extract fields.
Segment
The segment
object is a set of configuration options that split ingested documents into one or more segments based on the identified HTML headings (h1
, h2
).
"segment": {
"enabled": true,
"selector_tags": ["h1", "h2", "h3", "h4", "h5", "h6"]
}
"enabled": boolean
- required - must be set totrue
to enable document segmentation."selector_tags": array
- required - a comma separated array of HMTLh
tags on which to split documents.
As an overview, when document segmentation is enabled the following cannot be specified:
json_normalizations
cannot be specified as part of the configuration.normalizations
cannot be specified as part of the configuration.- The
extracted_fields
option of thehtml
conversion cannot be specified as part of the configuration.
For detailed information, see Performing segmentation.
JSON
You can perform pre-enrichment normalization of the ingested JSON by defining operation
objects in the json_normalizations
array.
"json_normalizations": [
{
"operation": "remove",
"source_field": "header"
},
{
"operation": "copy",
"source_field": "title",
"destination_field": "title_old"
},
{
"operation": "move",
"destination_field": "content",
"source_field": "body"
},
{
"operation": "merge",
"source_field": "synopsis",
"destination_field": "preamble"
},
{
"operation": "remove_nulls"
}
]
Operations objects
"operation": string
- required - the operation that is performed on the JSON, must be one of the following:remove
- the specifiedsource_field
is removed from the JSON.copy
- the specifiedsource_field
content is copied to a new instance of thedestination_field
.move
- the specifiedsource_field
is renamed to thedestination_field
. If thedestination_field
already exists a new instance of thedestination_field
is created.merge
- the contents of thesource_field
and thedestination_field
are merged into thedestination_field
.remove_nulls
- fields withnull
content are removed.
"source_field": string
- optional - the field that the operation is performed on.-
"destination_field": string
- optional - the destination field the the operation is output to.Field names defined in your configuration must meet the restrictions defined in Field Name Requirements.
Enrichments
"enrichments": [
{
"enrichment": "elements",
"source_field": "html",
"destination_field": "enriched_html",
"options": {
"model": "contract"
}
},
{
"enrichment": "natural_language_understanding",
"source_field": "title",
"destination_field": "enriched_title",
"options": {
"features": {
"keywords": {
"sentiment": true,
"emotion": false,
"limit": 50
},
"entities": {
"sentiment": true,
"emotion": false,
"limit": 50,
"mentions": true,
"mention_types": true,
"sentence_locations": true,
"model": "WKS-model-id"
},
"sentiment": {
"document": true,
"targets": [
"IBM",
"Watson"
]
},
"emotion": {
"document": true,
"targets": [
"IBM",
"Watson"
]
},
"categories": {},
"concepts": {
"limit": 8
},
"semantic_roles": {
"entities": true,
"keywords": true,
"limit": 50
},
"relations": {
"model": "WKS-model-id"
}
}
}
}
]
Discovery supports adding Natural Language Understanding and Element Classification enrichments. Each field that you want to enrich is defined by an object in the enrichments
array. Each enrichment object requires a source_field
,
a destination_field
and enrichments to specified.
The Element Classification enrichment is deprecated and will no longer be available, effective 10 July 2020.
-
"enrichment" : string
- required - The type of enrichment to use on this field. To extract Natural Language Understanding enrichments usenatural_language_understanding
, to perform Element Classification useelements
.When you use the
elements
enrichment, it is important to follow the guidelines specified in Element Classification documentation. Specifically, only PDF files can be ingested when this enrichment is specified. -
"source_field" : string
- required - The source field to be enriched. This field must exist in your source after thejson_normalizations
operation completes. -
"destination_field" : string
- required - The name of the container object where enrichments are created.Field names defined in your configuration must meet the restrictions defined in Field Name Requirements.
Element Classification enrichments
The Element Classification enrichment is deprecated and will no longer be available, effective 10 July 2020.
When you use the Element Classification, each elements
enrichment object must contain an "options": {}
object with the following parameters specified:
"model" : string
- required - The element extraction model to be used with on this document. Currently supported models are:contract
When you use the elements
enrichment, it is important to follow the guidelines specified in Element Classification. Specifically, only PDF files can be ingested
when this enrichment is specified.
Natural Language Understanding enrichments
When you use the Natural Language Understanding, each object within the enrichments
array must also contain an "options": { "features": { } }
object that contains one or more of the following enrichments:
categories
The categories
enrichment identifies any general categories in the ingested document. This enrichment has no options and must be specified as an empty object "categories" : {}
concepts
The concepts
enrichment finds concepts with which the input text is associated, based on other concepts and entities that are present in that text.
"limit" : INT
- required - The maximum number of concepts to extract from the ingested document.
emotion
The emotion
enrichment evaluates the overall emotional tone (for example anger
) of entire document or specified target strings in the entire document. This enrichment can only be used with English content.
"document" : boolean
optional - Whentrue
the emotional tone of the entire document is evaluated."targets" : array
optional - A comma-separated array of target strings of which to evaluate the emotional state within the document.
entities
The entities
enrichment extracts instances of known entities such as people, places, and organizations. Optionally, a Knowledge Studio custom model can be specified to extract custom entities.
"sentiment" : boolean
- optional - Whentrue
, sentiment analysis is performed on the extracted entity in the context of the surrounding content."emotion" : boolean
- optional - Whentrue
, emotional tone analysis is performed on the extracted entity in the context of the surrounding content."limit" : INT
- optional - The maximum number of entities to extract from the ingested document. The default is50
."mentions": boolean
- optional - Whentrue
, the number of times that this entity is mentioned is recorded. The default isfalse
."mention_types": boolean
- optional - Whentrue
, the mention type for each mention of this entity is stored. The default isfalse
."sentence_location": boolean
- optional - Whentrue
, the sentence location of each entity mention is stored. The default isfalse
."model" : string
- optional - When specified, the custom model is used to extract entities instead of the public model. This option requires a Knowledge Studio custom model to be associated with your instance of Discovery. See Integrating with Watson Knowledge Studio for more information.
keywords
The keywords
enrichment extracts instances of significant words within the text. To understand the difference between keywords, concepts, and entities, see Understanding the difference between Entities, Concepts, and Keywords.
"sentiment" : boolean
- optional - Whentrue
, sentiment analysis is performed on the extracted keyword in the context of the surrounding content."emotion" : boolean
- optional - Whentrue
, emotional tone analysis is performed on the extracted keyword in the context of the surrounding content."limit" : INT
- optional - The maximum number of keywords to extract from the ingested document. The default is50
.
semantic_roles
The semantic_roles
enrichment identifies sentence components such as subject, action, and object within the ingested text.
"entities" : boolean
- optional Whentrue
, entities are extracted from the sentence components."keywords" : boolean
- optional Whentrue
, keywords are extracted from the sentence components."limit" : INT
- optional - The maximum number ofsemantic_roles
objects to extract (sentences to parse) from the ingested document. The default is50
.
sentiment
The sentiment
enrichment evaluates the overall sentiment level of entire document or specified target strings in the entire document.
"document" : boolean
optional - Whentrue
the sentiment of the entire document is evaluated."targets" : array
optional - A comma-separated array of target strings to evaluate sentiment of within the document.
relations
The relations
enrichment extracts known relationships between identified entities within the document. Optionally, a Knowledge Studio custom model can be specified to extract custom relationships.
"model" : string
- optional - When specified, the custom model is used to extract relations instead of the public model. This option requires a Knowledge Studio custom model to be associated with your instance of Discovery. See Integrating with Watson Knowledge Studio for more information.
Normalization
The normalizations
array is an array of JSON operation
objects that are used to clean the ingested JSON after the enrichments
are applied and before it is stored.
"normalizations": [
{
"operation": "remove",
"source_field": "enriched_title.entities.text"
},
{
"operation": "copy",
"source_field": "enriched_title.sentiment.document.score",
"destination_field": "titlescore"
}
]
{: codeblock}}
The operation
object options are listed here
Field name requirements
Field names cannot contain spaces. The following characters and strings are reserved and cannot be used in field names:
. , # ? :
id
score
highlight
result_metadata
The characters _
, +
, and -
cannot be used to prefix the field_name
Do not append numerical characters 0 - 9
to a field name, for example extracted-content2
. These field names are indexed but cannot be queried.