IBM Cloud Docs
Understand tables

Understand tables

Apply the Table Understanding enrichment to get detailed information about tables and table-related data within documents.

The following tasks generate an HTML field with table information and apply the Table Understanding enrichment to it for your collection automatically:

  • If you use the Smart Document Understanding tool to define a user-trained or pretrained SDU model, the Table Understanding enrichment is applied to the html field that is generated for the collection.

  • If you create a Document Retrieval for Contracts project type, a pretrained SDU model is applied to your collection automatically. As a result, the Table Understanding enrichment is applied to the html field that is generated for the collection.

    For more information, see Smart Document Understanding.

Before you begin

The documents in your collection must contain a field with HTML representations of your tables. This information often is stored in the html field. If your collection consists of CSV or JSON files, it might have a field other than the html field that contains table information in HTML format.

Applying the table understanding enrichment

You can apply the enrichment only to a field that contains an HTML representation of the table.

To apply the enrichment, complete the following steps:

  1. From the navigation pane, open the Manage collections page, and then click a collection to open it.

  2. Click the Enrichments tab.

  3. Find the Table Understanding enrichment.

  4. Select the html field from the field list.

    Choose the field that contains HTML representations of the tables.

After the enrichment is applied, you can get valid results when you submit queries that require Discovery to find information that is stored in tables.

A developer can query tables by using the API. For more information, see Query parameters.

For more information about how to apply the table understanding enrichment by using the API, see Applying enrichments by using the API.

Working with tabular data in Python

Use Text Extensions for Pandas, an open-source library from IBM, to read the tables that were parsed from documents in Discovery into pandas DataFrame objects. A pandas DataFrame is an object that represents two-dimensional tabular data in a form that can be transformed and manipulated for downstream analysis in Python.

For example, you can extract content from tables in many annual report documents and reconstruct it into a single table that includes multiyear data points of interest. For more information, read the Structured Information Extraction from Tables in PDF Documents with Pandas and IBM Watson blog post on Medium.com.

Output schema

The output schema from the Table Understanding enrichment is as follows.

{
  "tables": [
    {
      "location" : {
        "begin" : int,
        "end" : int
      },
      "text": string,
      "section_title": {
        "text": string,
        "location": {
          "begin" : int,
          "end" : int
        }
      },
      "title": {
        "location": {
          "begin": int,
          "end": int,
        },
        "text": string
      },
      "table_headers" : [
        {
          "cell_id" : string,
          "location" : {
            "begin" : int,
            "end" : int
          },
          "text" : string,
          "row_index_begin" : int,
          "row_index_end" : int,
          "column_index_begin" : int,
          "column_index_end" : int
        },
        ...
      ],
      "column_headers" : [
        {
          "cell_id" : string,
          "location" : {
            "begin" : int,
            "end" : int
          },
          "text" : string,
          "text_normalized" : string,
          "row_index_begin" : int,
          "row_index_end" : int,
          "column_index_begin" : int,
          "column_index_end" : int
        },
        ...
      ],
      "row_headers" : [
        {
          "cell_id" : string,
          "location" : {
            "begin" : int,
            "end" : int
          },
          "text" : string,
          "text_normalized" : string,
          "row_index_begin" : int,
          "row_index_end" : int,
          "column_index_begin" : int,
          "column_index_end" : int
        },
        ...
      ],
      "body_cells" : [
        {
          "cell_id" : string,
          "location" : {
            "begin" : int,
            "end" : int
          },
          "text" : string,
          "row_index_begin" : int,
          "row_index_end" : int,
          "column_index_begin" : int,
          "column_index_end" : int,
          "row_header_ids": [ string ],
          "row_header_texts": [ string ],
          "row_header_texts_normalized": [ string ],
          "column_header_ids": [ string ],
          "column_header_texts": [ string ],
          "column_header_texts_normalized": [ string ],
          "attributes" : [
             {
               "type" : string,
               "text" : string,
               "location" : {
                 "begin" : int,
                 "end" : int
               }
             },
             ...
           ]
        },
        ...
      ],
      "key_value_pairs": [
        {
          "key": {
            "cell_id": string,
            "location": {
              "begin": int,
              "end": int
            },
            "text": string
          },
          "value": [{
            "cell_id": string,
            "location": {
              "begin": int,
              "end": int
            },
            "text": string
          },
          ...
          ]
        },
        ...
      ],
      "contexts": [
        {
          "text": string,
          "location": {
            "begin": int,
            "end": int
          }
        },
        ...
      ]
    }
  ]
}

Schema arrangement

The schema is arranged as follows.

  • tables: An array that defines the tables that are identified in the input document.

    • location: The location of the current table as defined by its begin and end indexes in the input document.

    • text: The textual contents of the current table from the input document without associated markup content.

    • section_title: If identified, the location of a section title contained in the current table. Empty if no section title is identified.

      • text: The text of the identified section title.
      • location: The location of the section title in the input document as defined by its begin and end indexes.
    • title: If identified, the title or caption of the current table of the form Table x.: .... Empty when no title is identified. When present, the title is excluded from the contexts array of the same table.

      • location: The location of the title in the input document as defined by its begin and end indexes.
      • text: The text of the identified table title or caption.
    • table_headers: An array of table-level cells applicable as headers to all the other cells of the current table. Each table header is defined as a collection of the following elements:

      • cell_id: The unique ID of the cell in the current table.
      • location: The location of the cell in the input document as defined by its begin and end indexes.
      • text: The textual contents of the cell from the input document without associated markup content.
      • row_index_begin: The begin index of the cell's row location in the current table.
      • row_index_end: The end index of the cell's row location in the current table.
      • column_index_begin: The begin index of the cell's column location in the current table.
      • column_index_end: The end index of the cell's column location in the current table.
    • column_headers: An array of column-level cells, each applicable as a header to other cells in the same column as itself, of the current table. Each column header is defined as a collection of the following items:

      • cell_id: The unique ID of the cell in the current table.
      • location: The location of the cell in the input document as defined by its begin and end indexes.
      • text: The textual contents of the cell from the input document without associated markup content.
      • text_normalized: Normalized column header text.
      • row_index_begin: The begin index of the cell's row location in the current table.
      • row_index_end: The end index of the cell's row location in the current table.
      • column_index_begin: The begin index of the cell's column location in the current table.
      • column_index_end: The end index of the cell's column location in the current table.
    • row_headers: An array of row-level cells, each applicable as a header to other cells in the same row as itself, of the current table. Each row header is defined as a collection of the following items:

      • cell_id: The unique ID of the cell in the current table.
      • location: The location of the cell in the input document as defined by its begin and end indexes.
      • text: The textual contents of the cell from the input document without associated markup content.
      • text_normalized: Normalized row header text.
      • row_index_begin: The begin index of the cell's row location in the current table.
      • row_index_end: The end index of the cell's row location in the current table.
      • column_index_begin: The begin index of the cell's column location in the current table.
      • column_index_end: The end index of the cell's column location in the current table.
    • body_cells: An array of cells that are not table header or column header or row header cells, of the current table with corresponding row and column header associations. Each body cell is defined as a collection of the following items:

      • cell_id: The unique ID of the cell in the current table.

      • location: The location of the cell in the input document as defined by its begin and end indexes.

      • text: The textual contents of the cell from the input document without associated markup content.

      • row_index_begin: The begin index of this cell's row location in the current table.

      • row_index_end: The end index of this cell's row location in the current table.

      • column_index_begin: The begin index of this cell's column location in the current table.

      • column_index_end: The end index of this cell's column location in the current table.

      • row_header_ids: An array of values, where each value is the cell ID value of a row header that is associated with this body cell.

      • row_header_texts: An array of values, where each value is the text from a row header for this body cell.

      • row_header_texts_normalized: An array of values, where each value is the normalized text from a row header for this body cell.

      • column_header_ids: An array of values, where each value is the cell ID value of a column header that is associated with this body cell.

      • column_header_texts: An array of values, where each value is the text from a column header for this body cell.

      • column_header_texts_normalized: An array of values, where each value is the normalized text from a column header for this body cell.

      • attributes: An array that identifies document attributes. Each object in the array consists of three elements:

        • type: The type of attribute. Possible values are Address, Currency, DateTime, Duration, Location, Number, Organization, Percentage, and Person.
        • text: The text that is associated with the attribute.
        • location: The location of the attribute as defined by its begin and end indexes.
    • key_value_pairs: An array that specifies any key-value pairs in tables in the input document. For more information, see Understanding key-value pairs.

      • key: An object that specifies a key for a key-value pair.

        • cell_id: The unique ID of the key in the table.
        • location: The location of the key cell in the input document as defined by its begin and end indexes.
        • text: The text content of the table cell without HTML markup.
      • value: An array that specifies the value or values for a key-value pair.

        • cell_id: The unique ID of the value in the table.
        • location: The location of the value cell in the input document as defined by its begin and end indexes.
        • text: The text content of the table cell without HTML markup.
    • contexts: A list of related material that precedes and follows åthe table, excluding its section title, which is provided in the section_title field. Related material includes related sentences; footnotes; and sentences from other parts of the document that refer to the table. The list is represented as an array. Each object in the array consists of the following elements:

      • text: The text contents of a related material from the input document, without HTML markup.
      • location: The location of the related material in the input document as defined by its begin and end indexes.

Notes on the table output schema

  • Row and column index values per cell are zero-based and so begin with 0.
  • Multiple values in arrays of row_header_ids and row_header_texts elements indicate a possible hierarchy of row headers.
  • Multiple values in arrays of column_header_ids and column_header_texts elements indicate a possible hierarchy of column headers.

Examples

The following table is an example table from an input document.

Example table
Figure 1. Table example

The table is composed as follows:

Table composition
Figure 2. Anatomy of the example table

The following syntax is used in the table:

  • Bold text indicates a column header
  • Italic text indicates a row header
  • Unstyled text indicates a body cell

The output from service represents the example's first body cell (that is, the first cell in row 3 with a value of 35.0%) as follows:

{
  "tables": [ {
    "location": {
      "begin": 872,
      "end": 5879
    },
    "text": "...",
    "section_title": {
      "text": "",
      "location": {
        "begin": 0,
        "end": 0
      }
    },
    "table_headers" : [ ],
    "column_headers" : [ {
      "cell_id" : "colHeader-1050-1082",
      "location" : {
        "begin" : 1050,
        "end" : 1083
      },
      "text" : "Three months ended September 30,",
      "text_normalized" : "Three months ended September 30,",
      "row_index_begin" : 0,
      "row_index_end" : 0,
      "column_index_begin" : 1,
      "column_index_end" : 2
    }, {
      "cell_id" : "colHeader-1270-1301",
      "location" : {
        "begin" : 1270,
        "end" : 1302
      },
      "text" : "Nine months ended September 30,",
      "text_normalized" : "Nine months ended September 30,",
      "row_index_begin" : 0,
      "row_index_end" : 0,
      "column_index_begin" : 3,
      "column_index_end" : 4
    }, {
      "cell_id" : "colHeader-1544-1548",
      "location" : {
        "begin" : 1544,
        "end" : 1549
      },
      "text" : "2005",
      "text_normalized" : "Year 1",
      "row_index_begin" : 1,
      "row_index_end" : 1,
      "column_index_begin" : 1,
      "column_index_end" : 1
    }, {
      "cell_id" : "colHeader-1712-1716",
      "location" : {
        "begin" : 1712,
        "end" : 1717
      },
      "text" : "2004",
      "text_normalized" : "Year 2",
      "row_index_begin" : 1,
      "row_index_end" : 1,
      "column_index_begin" : 2,
      "column_index_end" : 2
    }, {
      "cell_id" : "colHeader-1889-1893",
      "location" : {
        "begin" : 1889,
        "end" : 1894
      },
      "text" : "2005",
      "text_normalized" : "Year 1",
      "row_index_begin" : 1,
      "row_index_end" : 1,
      "column_index_begin" : 3,
      "column_index_end" : 3
    }, {
      "cell_id" : "colHeader-2057-2061",
      "location" : {
        "begin" : 2057,
        "end" : 2062
      },
      "text" : "2004",
      "text_normalized" : "Year 2",
      "row_index_begin" : 1,
      "row_index_end" : 1,
      "column_index_begin" : 4,
      "column_index_end" : 4
    } ],
    "row_headers" : [ {
      "cell_id" : "rowHeader-2244-2262",
      "location" : {
        "begin" : 2244,
        "end" : 2263
      },
      "text" : "Statutory tax rate",
      "text_normalized" : "Statutory tax rate",
      "row_index_begin" : 2,
      "row_index_end" : 2,
      "column_index_begin" : 0,
      "column_index_end" : 0
    }, {
      "cell_id" : "rowHeader-3197-3217",
      "location" : {
        "begin" : 3197,
        "end" : 3218
      },
      "text" : "IRS audit settlement",
      "text_normalized" : "IRS audit settlement",
      "row_index_begin" : 3,
      "row_index_end" : 3,
      "column_index_begin" : 0,
      "column_index_end" : 0
    }, {
      "cell_id" : "rowHeader-4148-4176",
      "location" : {
        "begin" : 4148,
        "end" : 4177
      },
      "text" : "Dividends received deduction",
      "text_normalized" : "Dividends received deduction",
      "row_index_begin" : 4,
      "row_index_end" : 4,
      "column_index_begin" : 0,
      "column_index_end" : 0
    }, {
      "cell_id" : "rowHeader-5106-5130",
      "location" : {
        "begin" : 5106,
        "end" : 5131
      },
      "text" : "Total effective tax rate",
      "text_normalized" : "Total effective tax rate",
      "row_index_begin" : 5,
      "row_index_end" : 5,
      "column_index_begin" : 0,
      "column_index_end" : 0
    } ],
    "key_value_pairs" : [ ],
    "body_cells" : [ {
      "cell_id" : "bodyCell-2450-2455",
      "location" : {
        "begin" : 2450,
        "end" : 2456
      },
      "text" : "35.0%",
      "row_index_begin" : 2,
      "row_index_end" : 2,
      "column_index_begin" : 1,
      "column_index_end" : 1,
      "row_header_ids" : [ "rowHeader-2244-2262" ],
      "row_header_texts" : [ "Statutory tax rate" ],
      "row_header_texts_normalized" : [ "Statutory tax rate" ],
      "column_header_ids" : [ "colHeader-1050-1082", "colHeader-1544-1548" ],
      "column_header_texts" : [ "Three months ended September 30,", "2005" ],
      "column_header_texts_normalized" : [ "Three months ended September 30,", "Year 1" ],
      "attributes": [ ]
    }, {
      "cell_id" : "bodyCell-2633-2638",
      "location" : {
        "begin" : 2633,
        "end" : 2639
      },
      "text" : "35.0%",
      "row_index_begin" : 2,
      "row_index_end" : 2,
      "column_index_begin" : 2,
      "column_index_end" : 2,
      "row_header_ids" : [ "rowHeader-2244-2262" ],
      "row_header_texts" : [ "Statutory tax rate" ],
      "row_header_texts_normalized" : [ "Statutory tax rate" ],
      "column_header_ids" : [ "colHeader-1050-1082", "colHeader-1712-1716" ],
      "column_header_texts" : [ "Three months ended September 30,", "2004" ],
      "column_header_texts_normalized" : [ "Three months ended September 30,", "Year 2" ],
      "attributes": [ ]
    }, {
      "cell_id" : "bodyCell-2825-2830",
      "location" : {
        "begin" : 2825,
        "end" : 2831
      },
      "text" : "35.0%",
      "row_index_begin" : 2,
      "row_index_end" : 2,
      "column_index_begin" : 3,
      "column_index_end" : 3,
      "row_header_ids" : [ "rowHeader-2244-2262" ],
      "row_header_texts" : [ "Statutory tax rate" ],
      "row_header_texts_normalized" : [ "Statutory tax rate" ],
      "column_header_ids" : [ "colHeader-1270-1301", "colHeader-1889-1893" ],
      "column_header_texts" : [ "Nine months ended September 30,", "2005" ],
      "column_header_texts_normalized" : [ "Nine months ended September 30,", "Year 1" ],
      "attributes": [ ]
    }, {
      "cell_id" : "bodyCell-3008-3013",
      "location" : {
        "begin" : 3008,
        "end" : 3014
      },
      "text" : "35.0%",
      "row_index_begin" : 2,
      "row_index_end" : 2,
      "column_index_begin" : 4,
      "column_index_end" : 4,
      "row_header_ids" : [ "rowHeader-2244-2262" ],
      "row_header_texts" : [ "Statutory tax rate" ],
      "row_header_texts_normalized" : [ "Statutory tax rate" ],
      "column_header_ids" : [ "colHeader-1270-1301", "colHeader-2057-2061" ],
      "column_header_texts" : [ "Nine months ended September 30,", "2004" ],
      "column_header_texts_normalized" : [ "Nine months ended September 30,", "Year 2" ],
      "attributes": [ ]
    },
    ...
  ],
  "contexts": [ ]
}

Understanding key-and-value pairs

Tables sometimes contain key-and-value pairs that span multiple table cells. Table Understanding can detect the following types of tabular pairs.

  • Simple key-and-value pairs in adjacent cells, as in the following example table:

    Basic table
    Key Value
    Item number 123456789
    Date 1/1/2019
    Amount $1,000
  • Key-and-value pairs in the same cell, as in the following example table:

    Complex table
    Key-value pairs Key-value pairs
    Item number: 123456789 Amount: $1000
    Date: 1/1/2019 Address: 123 Anywhere Dr