Split documents to make query results more succinct

Split your documents so that the search function can find more concise information to return in query results.

For more information about the benefits of splitting documents, read the Using IBM Watson Discovery’s New Document Segmentation Feature blog post on Medium.com.

You can split only documents to which a user-trained Smart Document Understanding model is applied.

When you split a document, the original document is broken into segments. Each segment contains a more uniform set of information. By splitting the content in your documents into segmented groups, you can enrich and index your data at a more granular level.

To control how your documents are split, you specify a field, such as subtitle or question, to use as the page break marker. The page break options are populated with fields that are created when you apply a user-trained Smart Document Understanding (SDU) model to the documents. For more information, see Using Smart Document Understanding. You cannot split documents with fields that are generated by a pretrained Smart Document Understanding model.

As a document is reprocessed, it is evaluated from start to end. Whenever the page break marker field occurs, the original document is split and a new segment is created. The splitting continues at each marker field until the original document is broken into multiple segments.

Before you begin, decide which field to use as the page break marker.

You can use any of the fields that are indexed by default. To see your choices, check the Fields to index list. Fields that have a Type value are stored in the index.
The number of segments per document is limited to 1,000. After segment number 999 is created, any remaining document content is stored within segment 1,000.
Metadata from PDF and Microsoft Word documents and any custom metadata is extracted and included in the index with each segment.

Be careful with documents that contain repeating sections, such as a catalog that has a description and specifications section for each product entry. If you split the document at too granular a level, the subsections, such as a section with specification details, can be disassociated from the product to which it belongs.

To split the documents in a collection, complete the following steps:

Click Manage collections from the navigation panel, and then click to open a collection.
Open the Manage fields page.

A list of the identified fields is displayed.
From the Improve query results by splitting your documents section, click Split document.
Choose the field that you want to use as your page break marker from the Select field dropdown.

The list that you can choose from includes a subset of all the identified fields.
Click Apply changes and reprocess.

You can check the status of the splitting process from the Activity page.

The metadata field includes the parent document ID. Each resulting segment of the original document can contain different information. For example, if you split the document based on the subtitle field, the first segment might contain only a title field. The next segment might contain a subtitle and a text field. The third might contain a subtitle field, a text field, and a footer field.

Updating documents that were split

If a document that was split changes and you want to upload the document again, work with a developer to replace the document by using the API. A developer can use the Update a document method to replace the original parent document. For more information, see the API reference. To provide the {document_id} path variable that must be sent with the request, copy the contents of the parent_document_id field of one of the document's segments.

When you replace the original document, all of the segments are overwritten, unless the updated version of the document has fewer total segments than the original. Those older segments remain in the index.

Deleting document segments from the index

You can delete documents in a collection from the Manage data page. To find all of the document segments that were generated from a single document, check for documents with the same metadata.parent_document_id field value. For more information, see Excluding content from query results.

IBM Cloud Pak for Data IBM Cloud Pak for Data before the 4.6.5 release

The Manage data page is available in installed deployments starting with the 4.6.5 release. In earlier releases, a developer can delete document segments by using the API. For more information, see the delete document API.