Introduction

You can use a collection of Watson Data REST APIs associated with Watson Studio and Watson Knowledge Catalog to manage data-related assets and the people who need to use these assets.

Refine data Use the sampling APIs to create representative subsets of the data on which to test and refine your data cleansing and shaping operations. To better understand the contents of your data, you can create profiles of your data assets that include a classification of the data and additional distribution information which assists in determining the data quality.

Catalog data Use the catalog APIs to create catalogs to administer your assets, associate properties with those assets, and organize the users who use the assets. Assets can be notebooks or connections to files, database sources, or data assets from a connection.

Data policies Use the data policy APIs to implement data policies and a business glossary that fits to your organization to control user access rights to assets and to make it easier to find data.

Ingest streaming data Use the streams flow APIs to hook up continuous, unidirectional flows of massive volumes of moving data that you can analyze in real time.

API Endpoint

https://api.dataplatform.cloud.ibm.com

Creating an IAM bearer token

Before you can call a Watson Data API you must first create an IAM bearer token. Each token is valid only for one hour, and after a token expires you must create a new one if you want to continue using the API. The recommended method to retrieve a token programmatically is to create an API key for your IBM Cloud identity and then use the IAM token API to exchange that key for a token.

You can create a token in IBM Cloud or by using the IBM Cloud command line interface (CLI).

To create a token in the IBM Cloud:

  1. Log in to IBM Cloud and select Manage > Security > Platform API Keys.
  2. Create an API key for your own personal identity, copy the key value, and save it in a secure place. After you leave the page, you will no longer be able to access this value.
  3. With your API key, set up Postman or another REST API tool and run the following command to the right
  1. Use the value of the access_token property for your Watson Data API calls. Set the access_token value as the authorization header parameter for requests to the Watson Data APIs. The format is Authorization: Bearer <access_token_value_here>. For example:
    Authorization: Bearer eyJraWQiOiIyMDE3MDgwOS0wMDowMDowMCIsImFsZyI6IlJTMjU2In0...

To create a token by using the IBM Cloud CLI:

  1. Follow the steps to install the CLI, log in to IBM Cloud, and get the token described here.

    Remove Bearer from the returned IAM token value in your API calls.

Curl command with API key to retrieve token

        curl "https://iam.ng.bluemix.net/identity/token" \
        -d "apikey=YOUR_API_KEY_HERE&grant_type=urn%3Aibm%3Aparams%3Aoauth%3Agrant-type%3Aapikey" \
        -H "Content-Type: application/x-www-form-urlencoded" \
        -H "Authorization: Basic Yng6Yng="

Response

        {
        "access_token": "eyJraWQiOiIyMDE3MDgwOS0wMDowMDowMCIsImFsZyI6...",
        "refresh_token": "zmRTQFKhASUdF76Av6IUzi9dtB7ip8F2XV5fNgoRQ0mbQgD5XCeWkQhjlJ1dZi8K...",
        "token_type": "Bearer",
        "expires_in": 3600,
        "expiration": 1505865282
        }

Versioning

Watson Data API has a major, minor, and patch version, following industry conventions on semantic versioning: Using the version number format MAJOR.MINOR.PATCH, the MAJOR version is incremented when incompatible API changes are made, the MINOR version is incremented when functionality is added in a backwards-compatible manner, and the PATCH version is incremented when backwards-compatible bug fixes are made. The service major version is represented in the URL path.

Sorting

Some of the Watson Data API collections provide custom sorting support. Custom sorting is implemented using the sort query parameter. Service collections can also support single-field or multi-field sorting. The sort parameter in collections that support single-field sorting can contain any one of the valid sort fields.

For example, the following expression would sort accounts on company name (ascending):GET /v2/accounts?sort=company_name.

You can also add a + or - character, indicating “ascending” or “descending,” respectively.

For example, the expression below would sort on the last name of the account owner, in descending order:GET /v2/accounts?sort=-owner.last_name.

The sort parameter in collections that support sorting on multiple fields can contain a comma-separated sequence of fields (each, optionally, with a + or -) in the same format as the single-field sorting. Sorts are applied to the data set in the order that they are provided. For example, the expression below would sort accounts first on company name (ascending) and second on owner last name (descending): GET /v2/accounts?sort=company_name,-owner.last_name

Filtering

Some of the Watson Data API collections provide filtering support. You can specify one or more filters where each supported field is required to match a specific value for basic filtering. The query parameter names for a basic filter must exactly match the name of a primitive field on a resource in the collection or a nested primitive field where the '.' character is the hierarchical separator. The only exception to this rule is for primitive arrays. In primitive arrays, such as tags, a singular form of the field is supported as a filter that matches the resource if the array contains the supplied value. Some of the Watson Data API collections can also support extended filtering comparisons for the following field types: Integer and float, date and date/time, identifier and enumeration, and string.

Rate Limiting

The following rate limiting headers are supported by some of the Watson Data service APIs: 1. X-RateLimit-Limit: If rate limiting is active, this header indicates the number of requests permitted per hour; 2. X-RateLimit-Remaining: If rate limiting is active, this header indicates the number of requests remaining in the current rate limit window; 3. X-RateLimit-Reset: If rate limiting is active, this header indicates the time at which the current rate limit window resets, as a UNIX timestamp.

Error Handling

Responses with 400-series or 500-series status codes are returned when a request cannot be completed. The body of these responses follows the error model, which contains a code field to identify the problem and a message field to explain how to solve the problem. Each individual endpoint has specific error messages. All responses with 500 or 503 status codes are logged and treated as a critical failure requiring an emergency fix.

Connections

A connection is the information necessary to create a connection to a data source or a repository. You create a connection asset by providing the connection information.

List data source types

Data sources are where data can be written or read and might include relational database systems, file systems, object storage systems and others.

To list supported data source types, call the following GET method:

GET /v2/datasource_types

The response to the GET method includes information about each of the sources and targets that are currently supported. The response includes a unique ID property value metadata.asset_id, name, and a label. The metadata.asset_id property value should be used for the data source in other APIs that reference a data source type. Additional useful information such as whether that data source can be used as a source or target (or both) is also included.

Use the connection_properties=true query parameter to return a set of properties for each data source type that is used to define a connection to it. Use the interaction_properties=true query parameter to return a set of properties for each data source type that is used to interact with a created connection. Interaction properties for a relational database might include the table name and schema from which to retrieve data.

Use the _sort query parameter to order the list of data source type returned in the response.

A default maximum of 100 data source type entries are returned per page of results. Use the _limit query parameter with an integer value to specify a lower limit.

More data source types than those on the first page of results might be available. Additional properties generated from the page size initially specified with _limit are returned in the response. Call a GET method using the value of the next.href property to retrieve the next page of results. Call a GET method using the value in the prev.href property to retrieve the previous page of results. Call a GET method using the value in the last.href property to retrieve the last page of results.

These URIs use the _offset and _limit query parameters to retrieve a specific block of data source types from the full list. Alternatively, you can use a combination of the _offset and _limit query parameters to retrieve a custom block of results.

Create a connection

Connections to any of the supported data source types returned by the previous method can be created and persisted in a catalog or project.

To create a connection, call the following POST method:

POST /v2/connections

A new connection can be created in a catalog or project. Use the catalog_id or project_id query parameter to specify where to create the connection asset. Either catalog_id or project_id is required.

The request body for the method is a UTF-8 encoded JSON document and includes the data source type ID (obtained in the List data source types section), its unique name in the catalog or project space, and a set of connection properties specific to the data source. Some connection properties are required.

The following example shows the request body used for creating a connection to IBM dashDB:

{
     "datasource_type": "cfdcb449-1204-44ba-baa6-9a8a878e6aa7",
     "name":"My-DashDB-Connection",
     "properties": {
       "host":"dashDBhost.com",
       "port":"50001",
       "database":"MYDASHDB",
       "password": "mypassword",
       "username": "myusername"
     }
}

By default, the physical connection to the data source is tested when the connection is created. Use the test=false query parameter to disable the connection test.

A response payload containing a connection ID and other metadata is returned when a connection is successfully created. Use the connection ID as path parameter in other REST APIs when a connection resource must be referenced.

Discover connection assets

Data sources contain data and metadata describing the data they contain.

To discover or browse the data or metadata in a data source, call the following GET method:

GET /v2/connections/{connection_id}/assets?path=

Use the catalog_id or project_id query parameter to specify where the connection asset was created. Either catalog_id or project_id is required.

connection_id is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections method, which created the connection asset.

The path query parameter is required and is used to specify the hierarchical path of the asset within the data source to be browsed. In a relational database, for example, the path might represent a schema and table. For a file object, the path might represent a folder hierarchy.

Each asset in the assets array returned by this method includes a property containing its path in the hierarchy to facilitate the next call to drill down deeper in the hierarchy.

For example, starting at the root path in an RDBMS will return a list of schemas:

{
    "path": "/",
    "asset_types": [
        {
            "type": "schema",
            "dataset": false,
            "dataset_container": true
        }
    ],
    "assets": [
        {
            "id": "GOSALES",
            "type": "schema",
            "name": "GOSALES",
            "path": "/GOSALES"
        },
    ],
    "fields": [],
    "first": {
        "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
    },
    "prev": {
        "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
    },
    "next": {
        "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=100&_limit=100"
    }
}

Drill down into the GOSALES schema using the path property for the GOSALES schema asset to discover the list of table assets in the schema.

GET /v2/connections/{connection_id}/assets?catalog_id={catalog_id}&path=/GOSALES

The list of table type assets is returned in the response.

{
    "path": "/GOSALES",
    "asset_types": [
        {
            "type": "table",
            "dataset": true,
            "dataset_container": false
        }
    ],
    "assets": [
        {
            "id": "BRANCH",
            "type": "table",
            "name": "BRANCH",
            "description": "BRANCH contains address information for corporate offices and distribution centers.",
            "path": "/GOSALES/BRANCH"
        },
        {
            "id": "CONVERSION_RATE",
            "type": "table",
            "name": "CONVERSION_RATE",
            "description": "CONVERSION_RATE contains currency exchange values.",
            "path": "/GOSALES/CONVERSION_RATE"
        }
    ],
    "fields": [],
    "first": {
        "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
    },
    "prev": {
        "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
    },
    "next": {
        "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=100&_limit=100"
    }
}

Use the fetch query parameter with a value of either data, metadata, or both. Data can only be fetched for data set assets. In the response above, note the asset_type has the property type value of table. Its dataset property value is true. This means that data can be fetched from table type assets. However, if you fetched assets from the connection root, the response would contain schema asset types, which are not data sets and thus fetching this data is not relevant.

A default maximum of 100 metadata assets are returned per page of results. Use the _limit query parameter with an integer value to specify a lower limit. More assets than those on the first page of results might be available.

Additional properties generated from the page size initially specified with _limit are returned in the response. Call a GET method using the value of the next.href property to retrieve the next page of results. Call a GET method using the value in the prev.href property to retrieve the previous page of results. Call a GET method using the value in the last.href property to retrieve the last page of results.

These URIs use the _offset and _limit query parameters to retrieve a specific block of assets from the full list. Alternatively, use a combination of the _offset and _limit query parameters to retrieve a custom block of results.

Specify properties for reading delimited files

When reading a delimited file using this method, specify property values to correctly parse the file based on its format. These properties are passed to the method as a JSON object using the properties query parameter. The default file format (property file_format) is a CSV file. If the file is a CSV, the following property values are set by default:

Property Name Property Description Default Value Value Description
quote_character quote character double_quote double quotation mark
field_delimiter field delimiter comma comma
row_delimiter row delimiter carriage_return_linefeed carriage return followed by line feed
escape_character escape character double_quote double quotation mark

For CSV file formats, these property values can not be overwritten. If it is necessary to modify these properties to properly read a delimited file, set the file_format property to delimited. For generic delimited files, these properties have the following values:

Property Name Property Description Default Value Value Description
quote_character quote character none no character is used for a quote
field_delimiter field delimiter null no field delimiter value is set by default
row_delimiter row delimiter new_line Any new line representation
escape_character escape character none no character is used for an escape

This example sets file format properties for a generic delimited file:

GET https://{service_URL}/v2/connections/{connection_id}/assets?catalog_id={catalog_id}&path=/myFolder/myFile.txt&fetch=data&properties={"file_format":"delimited", "quote_character":"single_quote","field_delimiter":"colon","escape_character":"backslash"}

For more information about this method see the REST API Reference.

Discover assets using a transient connection

A data source's assets can be discovered without creating a persistent connection.

To browse assets without first creating a persistent connection, call the following POST method:

POST https://{service_URL}/v2/connections/assets?path=

This method is identical in behavior to the GET method in the Discover connection assetssection except for two differences:

  1. You define the connection properties in the request body of the REST API. You do not reference the connection ID of a persistent connection with a query parameter. The same JSON object used to create a persistent connection is used in the request body.
  2. You do not specify a catalog or project ID with a query parameter.

See the previous section to learn how to set properties used to read delimited files.

For more information about this method see the REST API Reference.

Update a connection

To modify the properties of a connection, call the following PATCH method:

PATCH /v2/connections/{connection_id}

connection_id is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections method, which created the connection asset.

Use the catalog_id or project_id query parameter to specify where the connection asset was created. Either catalog_id or project_id is required.

Set the Content-Type header to application/json-patch+json. The request body contains the connection properties to update using a JSON object in JSON Patch format.

Change the port number of the connection and add a description using this JSON Patch:

[
    {
        "op": "add",
        "path": "/description",
        "value": "My new PATCHed description"
    },
    {
        "op":"replace",
        "path":"/properties/port",
        "value":"40001"
    }
]

By default, the physical connection to the data source is tested when the connection is modified. Use the test=false query parameter to disable the connection test.

For more information about this method see the REST API Reference.

Delete a connection

To delete a persistent connection, call the following DELETE method:

DELETE /v2/connections/{connection_id}

connection_id is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections method, which created the connection asset.

Use the catalog_id or project_id query parameter to specify where the connection asset was created. Either catalog_id or project_id is required.

Schedules

Introduction

Schedules allow you to run a data flow, a notebook, a data profile, or any other given source more than once. It supports various repeat types namely hour, day, week, month, and year with 2 repeat end options namely, end date and the maximum number of runs.

Create a schedule

To create a schedule in a specified catalog or project, call the following POST method:

     HTTP Method : POST
     URI : /v2/schedules

Before you create a schedule, you must consider the following points:

  1. You must have a valid IAM token to make REST API calls and a project or catalog ID.

  2. You must be authorized (be assigned the correct role) to create schedules in the catalog or project.

  3. The start and end dates must be in the following format: YYYY-MM-DDTHH:mm:ssZ or YYYY-MM-DDTHH:mm:ss.sssZ (specified in RFC 3339).

  4. The supported repeat types are hour, day, week, month, and year.

  5. There are 2 repeat end options, namely max_invocations and end_date.

  6. The supported repeat interval is 1.

  7. There are 3 statuses for schedules, namely enabled, disabled, and finished. To create a schedule, the status must be enabled. The scheduling service updates the status to finished once it has finished running. You can stop or pause the scheduling service by updating the status to disabled.

  8. You can update the endpoint URI in the target HREF. Supported target methods are POST, PUT, PATCH, DELETE, and GET.

  9. Set generate_iam_token=true. When this option is set to true, the scheduling service generates an IAM token and passes it to the target URL at runtime. This IAM token is required to run schedules automatically at the scheduled intervals. This token is not to be confused with the IAM token required to make Watson Data API REST calls.

This POST method creates a schedule in a catalog with a defined start and a given end date:

    {
    "catalog_id": "aeiou",
    "description": "aeiou",
    "name": "aeiou",
    "tags": ["aeiou"],
    "start_date": "2017-08-22T01:02:14.859Z",
    "status": "enabled",
    "repeat": {
        "repeat_interval": 1,
        "repeat_type": "hour"
    },
    "repeat_end": {
        "end_date": "2017-08-24T01:02:14.859Z"
    },
    "target": {
        "href": "https://api.dataplatform.cloud.ibm.com/v2/data_profiles?start=false",
        "generate_iam_token": true,
        "method": "POST",
        "payload": "aeiou",
        "headers": [
         {
            "name": "content-type",
            "value": "application/json",
            "sensitive": false
           }
        ]
      }
    }

Get multiple schedules in a catalog or project

To get all schedules in the specified catalog or project, call the following GET method:

 HTTP Method: GET
 URI :/v2/schedules

You need the following information to get multiple schedules:

  1. A valid IAM token, schedule ID, and the catalog or project ID.

  2. You must be authorized to get schedules in the catalog or project.

You can filter the returned results by using the options entity.schedule.name and entity.schedule.status and can filter matching types by using StartsWith(starts:) and Equals(e:).

You can sort the returned results either in ascending or descending order by using one or more of the following options: entity.schedule.name, metadata.create_time, and entity.schedule.status.

Get a schedule

To get a schedule in the specified catalog or project, call the following GET method:

     HTTP Method: GET
     URI :/v2/schedules/{schedule_id}

You need the following information to get a schedule:

  1. A valid IAM token, schedule ID, and the catalog or project ID.

  2. You must be authorized to get a schedule in the catalog or project.

Update a schedule

To update a schedule in the specified catalog or project, call the following PATCH method:

HTTP Method: PATCH
URI :/v2/schedules/{schedule_id}

You need the following information to update a schedule:

  1. A valid IAM token, schedule ID, and the catalog or project ID.

  2. You must be authorized to update a schedule in the catalog or project.

You can update all the attributes under entity but can't update the attributes under meta-data.

Patch supports the replace, add, and remove operations. The replace operation can be used with all the attributes under entity. The add and remove operations can only be used with the repeat end options, namely max_invocations and end_date.

The start and end dates must be in the following format: YYYY-MM-DDTHH:mm:ssZ or YYYY-MM-DDTHH:mm:ss.sssZ (specified in RFC 3339).

This PATCH method replaces the repeat type, removes the max invocations and adds an end date:

    [
     {
     "op": "remove",
     "path": "/entity/schedule/repeat_end/max_invocations",
     "value":  20
     },
     {
     "op": "add",
     "path": "/entity/schedule/repeat_end/end_date",
     "value": "date"
     },
     {
     "op": "replace",
     "path": "/entity/schedule/repeat/repeat_type",
     "value":  "week"
     }
    ]

Delete a schedule

To delete a schedule in the specified catalog or project, call the following DELETE method:

    HTTP Method : DELETE
    URI :{GATEWAY_URL}/v2/schedules/{schedule_id}

":guid" represents the schedule_id of the deleted schedule.

You need the following information to delete a schedule:

  1. A valid IAM token, schedule ID, and the catalog or project ID.

  2. You must be authorized to delete a schedule in the catalog or project.

Delete multiple schedules

To delete multiple schedules in the specified catalog or project, call the following DELETE method:

HTTP Method: DELETE
URI :{GATEWAY_URL}/v2/schedules

":guid" represents the schedule_id of the deleted schedule.

You need the following information to delete multiple schedules:

  1. A valid IAM token, schedule ID, and the catalog or project ID.

  2. You must be authorized to delete schedules in the catalog or project.

  3. A comma-separated list of the schedule IDs. If schedule IDs are not listed in the parameter schedule_ids, the scheduling service will delete all the schedules in the catalog or project.

Data Flows

Introduction

A data flow can read data from a large variety of sources, process that data using pre-defined operations or custom code, and then write it to one or more targets. The runtime engine can handle large amounts of data so it's ideally suited for reading, processing, and writing data at volume.

The sources and targets that are supported include both Cloud and on-premises offerings as well as data assets in projects. Cloud offerings include IBM Cloud Object Storage, Amazon S3, and Azure, among others. On-premises offerings include IBM Db2, Microsoft SQL Server, and Oracle, among others.

For a list of the supported connectivity and the properties they support, see IBM Watson Data API Data Flows Service - Data Asset and Connection Properties.

Creating a data flow

The following example shows how to create a data flow that reads data from a table on IBM Db2 Warehouse on Cloud (previously called IBM dashDB), filters the data, and writes the data to a data asset in the project. The data flow created for this example will contain a linear pipeline, although in the general case, the pipeline forms a directed asymmetric graph (DAG).

Environments

Begin by creating a connection to an existing IBM Db2 Warehouse on Cloud instance to use as the source of the data flow. For further information on the connections service, see Connections.

Defining a source in a data flow

A data flow can contain one or more data sources. A data source is defined as a binding node in the data flow pipeline, which has one output and no inputs. The binding node must reference either a connection or a data asset. Depending on the type of connection or data asset, additional properties might also need to be specified. Refer to IBM Watson Data API Data Flows Service - Data Asset and Connection Properties to determine which properties are applicable for a given connection, and which of those are required. For IBM Db2 Warehouse on Cloud both select_statement and table_name are required, so you must include values for those in the data flow.

For the following example, reference the connection you created earlier. The binding node for the data flow's source is:

{
  "id": "source1",
  "type": "binding",
  "connection": {
    "properties": {
      "schema_name": "GOSALESHR",
      "table_name": "EMPLOYEE"
    },
    "ref": "85be3e09-1c71-45d3-8d5d-220d6a6ea850"
  },
  "outputs": [
    {
      "id": "source1Output"
    }
  ]
}

The outputs object declares the ID of the output port of this source as source1Output so that other nodes can read from it. You can see the schema and table name have been defined, and that the connection with ID 85be3e09-1c71-45d3-8d5d-220d6a6ea850 is being referenced.

Defining an operation in a data flow

A data flow can contain zero or more operations, with a typical operation having one or more inputs and one or more outputs. An operation input is linked to the output of a source or another operation. An operation can also have additional parameters which define how the operation performs its work. An operation is defined as an execution node in the data flow pipeline.

The following example creates a filter operation so that only rows with value greater than 2010-01-01 in the DATE_HIRED field are retained. The execution node for our filter operation is:

{  
  "id":"operation1",
  "type":"execution_node",
  "op":"com.ibm.wdp.transformer.FreeformCode",
  "parameters":{  
     "FREEFORM_CODE":"filter(DATE_HIRED>'2010-01-01*')"
  },
  "inputs":[  
     {  
        "id":"inputPort1",
        "links":[  
           {  
              "node_id_ref":"source1",
              "port_id_ref":"source1Output"
           }
        ]
     }
  ],
  "outputs":[  
     {  
        "id":"outputPort1"
     }
  ]
}

The inputs attribute declares an input port with ID inputPort1 which references the output port of the source node (node ID source1 and port ID source1Output). The outputs attribute declares the ID of the output port of this operation as outputPort1 so that other nodes can read from it. For this example, the operation is defined as a freeform operation, denoted by the op attribute value of com.ibm.wdp.transformer.FreeformCode. A freeform operation has only a single parameter named FREEFORM_CODE whose value is a snippet of Sparklyr code. In this snippet of code, a filter function is called with the arguments to retain only those rows with value greater than 2010-01-01 in the DATE_HIRED field.

The outputs attribute declares the ID of the output of this operation as outputPort1 so that other nodes can read from it.

Defining a target in a data flow

A data flow can contain zero or more targets. A target is defined as a binding node in the data flow pipeline which has one input and no outputs. As with the source, the binding node must reference either a connection or a data asset. When using a data asset as a target, specify either the ID or name of an existing data asset.

In the following example, a data asset is referenced by its name. The binding node for the data flow's target is:

{
  "id": "target1",
  "type": "binding",
  "data_asset": {
    "properties": {
      "name": "my_shapedFile.csv"
    }
  },
  "inputs": [
    {
      "links": [
        {
          "node_id_ref": "operation1",
          "port_id_ref": "outputPort1"
        }
      ],
      "id": "target1Input"
    }
  ]
}

The inputs object declares an input port with ID target1Input which references the output port of our operation node (node ID operation1 and port ID outputPort1). The name of the data asset to create or update is specified as my_shapedFile.csv. Unless otherwise specified, this data asset is assumed to be in the same catalog or project as that which contains the data flow.

Defining a parameterized property in a data flow

Properties contained within a data flow can be parameterised, allowing for the values associated with the property to be replaced at run-time. The paths referencing the parameterized properties are contained within the external parameters of the data flow pipeline. The paths can be defined as an RFC 6902 path, however we will also support the path containing the id of the object within the array. So instead of:

/entity/pipelines/0/nodes/0/connection/table_name

you could also use:

/entity/pipelines/<pipeline_id>/nodes/<node_id>/connection/table_name

Any external parameters that are defined as being required must be reconciled when the data flow is run. Any external parameters that are defined as not being required and that are not reconciled when the data flow is run will default to using the property values already contained within the data flow.

In the following example, the external parameter references a filter property within the data flow that may be reconciled when the data flow is run. The external parameters for the data flow's pipeline is:

[
  {
    "name": "freeform_update",
    "required": false,
    "paths": [
      "/entity/pipeline/pipelines/pipeline1/nodes/operation1/parameters/FREEFORM_CODE"
    ]
  }
]

Creating the data flow

Putting it all together, you can now call the API to create the data flow with the following POST method:

POST /v2/data_flows

The new data flow can be stored in a catalog or project. Use either the catalog_id or project_id query parameter, depending on where you want to store the data flow asset. An example request to create a data flow is shown below:

POST v2/data_flows?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218

Request payload:

{
  "name": "my_dataflow",
  "pipeline": {
    "doc_type": "pipeline",
    "version": "2.0",
    "primary_pipeline": "pipeline1",
    "pipelines": [
      {
        "id": "pipeline1",
        "nodes": [
          {
            "id": "source1",
            "type": "binding",
            "connection": {
              "properties": {
                "schema_name": "GOSALESHR",
                "table_name": "EMPLOYEE"
              },
              "ref": "85be3e09-1c71-45d3-8d5d-220d6a6ea850"
            },
            "outputs": [
              {
                "id": "source1Output"
              }
            ]
          },
          {
            "id": "operation1",
            "type": "execution_node",
            "op": "com.ibm.wdp.transformer.FreeformCode",
            "parameters": {
              "FREEFORM_CODE": "filter(DATE_HIRED>'2010-01-01*')"
            },
            "inputs": [
              {
                "id": "inputPort1",
                "links": [
                  {
                    "node_id_ref": "source1",
                    "port_id_ref": "source1Output"
                  }
                ]
              }
            ],
            "outputs": [
              {
                "id": "outputPort1"
              }
            ]
          },
          {
            "id": "target1",
            "type": "binding",
            "data_asset": {
              "properties": {
                "name": "my_shapedFile.csv"
              }
            },
            "inputs": [
              {
                "links": [
                  {
                    "node_id_ref": "operation1",
                    "port_id_ref": "outputPort1"
                  }
                ],
                "id": "target1Input"
              }
            ]
          }
        ],
        "runtime_ref": "runtime1"
      }
    ],
    "runtimes": [
      {
        "name": "Spark",
        "id": "runtime1"
      }
    ],
    "external_parameters": [
      {
        "name": "freeform_update",
        "required": false,
        "paths": [
          "/entity/pipeline/pipelines/pipeline1/nodes/operation1/parameters/FREEFORM_CODE"
        ]
      }
    ]
  }
}

The response will contain a dataflow ID which you will need later to run the data flow you created.

Working with data flow runs

What is a data flow run?

Each time a data flow is run, a new data flow run asset is created and stored in the project or catalog to record this event. This asset stores detailed metrics such as how many rows were read and written, a copy of the data flow that was run, and any logs from the engine. During a run, the information in the asset is updated to reflect the current state of the run. When the run completes (successfully or not), the information in the asset is updated one final time. If and when the data flow is deleted, any run assets of that data flow are also deleted.

As part of a data flow run it is possible to specify runtime values specific to this particular run, that will be used to override any parameterized properties defined when creating the associated data flow.

There are four components of a data flow run, which are accessible using different APIs.

  • Summary (GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}). A quick, at-a-glance view of a run with a summary of how many rows in total were read and written.
  • Detailed metrics (GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/metrics). Detailed metrics for each binding node in the data flow (link sources and targets).
  • Data flow (GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/origin). A copy of the data flow that was run at that point in time. (Remember that data flows can be modified between runs.)
  • Logs (GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/logs). The logs from the engine, which are useful for diagnosing run failures.
Run state life cycle

A data flow run has a defined life cycle, which is shown by its state attribute. The state attribute can have one of the following values:

  • starting The run was created but was not yet submitted to the engine.
  • queued The run was submitted to the engine and it is pending.
  • running The run is currently in progress.
  • finished The run finished and was successful.
  • error The run did not complete. An error occurred either before the run was sent to the engine or while the run was in progress.
  • stopping The run was canceled but it is still running.
  • stopped The run is no longer in progress.

The run states that define phases of progress are: starting, queued, running, stopping. The run states that define states of completion are: finished, error, stopped.

The following are typical state transitions you would expect to see:

  1. The run completed successfully: starting -> queued -> running -> finished.
  2. The run failed (for example, connection credentials were incorrect): starting -> queued -> running -> error.
  3. The run could not be sent to the engine (for example, the connection referenced does not exist): starting -> error.
  4. The run was stopped (for example, at users request): starting -> queued -> running -> stopping -> stopped.
Run a data flow

To run a data flow, call the following POST API:

POST /v2/data_flows/{data_flow_id}/runs?project_id={project_id}

The value of data_flow_id is the metadata.asset_id from your data flow. An example response from this API call might be:

{
    "metadata": {
        "asset_id": "ed09488c-6d51-48c4-b190-7096f25645d5",
        "asset_type": "data_flow_run",
        "create_time": "2017-12-21T10:51:47.000Z",
        "creator": "demo_dataflow_user@mailinator.com",
        "href": "https://api.dataplatform.cloud.ibm.com/v2/data_flows/cfdacdb4-3180-466f-8d4c-be7badea5d64/runs/ed09488c-6d51-48c4-b190-7096f25645d5?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218",
        "project_id": "ff1ab70b-0553-409a-93f9-ccc31471c218",
        "usage": {
            "last_modification_time": "2017-12-21T10:51:47.923Z",
            "last_modifier": "demo_dataflow_user@mailinator.com",
            "last_access_time": "2017-12-21T10:51:47.923Z",
            "last_accessor": "demo_dataflow_user@mailinator.com",
            "access_count": 0
        }
    },
    "entity": {
        "data_flow_ref": "cfdacdb4-3180-466f-8d4c-be7badea5d64",
        "name": "my_dataflow",
        "rov": {
            "mode": 0,
            "members": []
        },
        "state": "starting",
        "tags": []
    }
}
Creating a parameter set

A data flow can be run with parameter replacements that reference a created parameter set.

Each parameter is contained within a parameter set. A parameter can be either of type string, object, array, boolean or integer. The value should conform to the type specified.

To create a parameter set call the following POST API:

POST /v2/data_flows/parameter_sets?project_id={project_id}

Request payload:

{
  "name": "my_parameter_set",
  "parameters": [
    {
      "name": "TheTableName",
      "literal_value": {
        "type": "string",
        "value": "Employee"
      }
    },
    {
      "name": "param2",
      "literal_value": {
        "type": "object",
        "value": {
          "type": "string",
          "value": "Test Value"
        }
      }
    },
    {
      "name": "param3",
      "literal_value": {
        "type": "boolean",
        "value": true
      }
    },
    {
      "name": "param4",
      "literal_value": {
        "type": "array",
        "value": [
          "string1",
          "string2"
        ]
      }
    },
    {
      "name": "param5",
      "literal_value": {
        "type": "integer",
        "value": 1
      }
    }
  ]
}
Run a data flow with parameter replacement

At runtime we allow parameter replacement properties to be contained within the request body. These properties will be specific to this particular run, and will be used to replace the associated values of the parameterized properties defined when creating the related data flow. A parameter replacement property can be a reference to an existing parameter, within a stored a parameter set or a straight forward replacement object defined as a literal value.

Each parameter replacement defines a name, which is used to match with the name of an external parameter defined in the data flow. Once the association has been successfully made the runtime value will then replace the default value currently contained with the data flow.

An important point to note here is that the stored data flow is left unchanged, the values are only overridden for this particular run.

To run a data flow with parameter replacement call the following POST API:

POST /v2/data_flows/{data_flow_id}/runs?project_id={project_id}

Request payload:

{
  "param_replacements": [
    {
      "reference_value": {
        "parameter_set_ref": "6a750da0-7dc4-427a-b35d-939bb5be87f5",
        "parameter_set_param_name": "TheTableName"
      },
      "name": "table_name_update"
    },
    {
      "literal_value": {
        "value": "filter(DATE_HIRED>'2018-01-01*')"
      },
      "name": "freeform_update"
    }
  ]
}

The value of data_flow_id is the metadata.asset_id from your data flow.

An example response from this API call might be:

{
    "metadata": {
        "asset_id": "ed09488c-6d51-48c4-b190-7096f25645d5",
        "asset_type": "data_flow_run",
        "create_time": "2017-12-21T10:51:47.000Z",
        "creator": "demo_dataflow_user@mailinator.com",
        "href": "https://api.dataplatform.cloud.ibm.com/v2/data_flows/cfdacdb4-3180-466f-8d4c-be7badea5d64/runs/ed09488c-6d51-48c4-b190-7096f25645d5?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218",
        "project_id": "ff1ab70b-0553-409a-93f9-ccc31471c218",
        "usage": {
            "last_modification_time": "2017-12-21T10:51:47.923Z",
            "last_modifier": "demo_dataflow_user@mailinator.com",
            "last_access_time": "2017-12-21T10:51:47.923Z",
            "last_accessor": "demo_dataflow_user@mailinator.com",
            "access_count": 0
        }
    },
    "entity": {
        "data_flow_ref": "cfdacdb4-3180-466f-8d4c-be7badea5d64",
        "name": "my_dataflow",
        "rov": {
            "mode": 0,
            "members": []
        },
        "state": "starting",
        "tags": []
    }
}
Get a data flow run summary

To retrieve the latest summary of a data flow run, call the following GET method:

GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}?project_id={project_id}

The value of data_flow_id is the metadata.asset_id from your data flow. The value of data_flow_run_id is the metadata.asset_id from your data flow run. An example response from this API call might be:

{
    "metadata": {
        "asset_id": "ed09488c-6d51-48c4-b190-7096f25645d5",
        "asset_type": "data_flow_run",
        "create_time": "2017-12-21T10:51:47.000Z",
        "creator": "demo_dataflow_user@mailinator.com",
        "href": "https://api.dataplatform.cloud.ibm.com/v2/data_flows/cfdacdb4-3180-466f-8d4c-be7badea5d64/runs/ed09488c-6d51-48c4-b190-7096f25645d5?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218",
        "project_id": "ff1ab70b-0553-409a-93f9-ccc31471c218",
        "usage": {
            "last_modification_time": "2017-12-21T10:51:47.923Z",
            "last_modifier": "demo_dataflow_user@mailinator.com",
            "last_access_time": "2017-12-21T10:51:47.923Z",
            "last_accessor": "demo_dataflow_user@mailinator.com",
            "access_count": 0
        }
    },
    "entity": {
        "data_flow_ref": "cfdacdb4-3180-466f-8d4c-be7badea5d64",
        "engine_state": {
            "session_cookie": "route=Spark; HttpOnly; Secure",
            "engine_run_id": "804d17bd-5ed0-4d89-ba38-ab7890d61e45"
        },
        "name": "my_dataflow",
        "rov": {
            "mode": 0,
            "members": []
        },
        "state": "finished",
        "summary": {
            "completed_date": "2018-01-03T16:58:05.726Z",
            "engine_elapsed_secs": 9,
            "engine_completed_date": "2018-01-03T16:58:05.360Z",
            "engine_started_date": "2018-01-03T16:57:56.211Z",
            "engine_status_date": "2018-01-03T16:58:05.360Z",
            "engine_submitted_date": "2018-01-03T16:57:46.044Z",
            "total_bytes_read": 95466,
            "total_bytes_written": 42142,
            "total_rows_read": 766,
            "total_rows_written": 336
        },
        "tags": []
    }
}
Troubleshooting a failed run

If a data flow run fails, the state attribute is set to the value error. In addition to this, the run asset itself has an attribute called error which is set to a concise description of the error (where available from the engine). If this information is not available from the engine, a more general message is set in the error attribute. This means that the error attribute is never left unset if a run fails. The following example shows the error payload produced if a schema specified in a source connection's properties doesn't exist:

{
    "error": {
        "trace": "1c09deb8-c3f9-4dc1-ad5a-0fc4e7c97071",
        "errors": [
            {
                "code": "runtime_failed",
                "message": "While the process was running a fatal error occurred in the engine (see logs for more details): SCAPI: CDICO2005E: Table could not be found: \"BADSCHEMAGOSALESHR.EMPLOYEE\" is an undefined name.. SQLCODE=-204, SQLSTATE=42704, DRIVER=4.20.4\ncom.ibm.connect.api.SCAPIException: CDICO2005E: Table could not be found: \"BADSCHEMAGOSALESHR.EMPLOYEE\" is an undefined name.. SQLCODE=-204, SQLSTATE=42704, DRIVER=4.20.4\n\tat com.ibm.connect.jdbc.JdbcInputInteraction.init(JdbcInputInteraction.java:158)\n\t...",
                "extra": {
                    "account": "2d0d29d5b8d2701036042ca4cab8b613",
                    "diagnostics": "[PROJECT_ID-ff1ab70b-0553-409a-93f9-ccc31471c218] [DATA_FLOW_ID-cfdacdb4-3180-466f-8d4c-be7badea5d64] [DATA_FLOW_NAME-my_dataflow] [DATA_FLOW_RUN_ID-ed09488c-6d51-48c4-b190-7096f25645d5]",
                    "environment_name": "ypprod",
                    "http_status": 400,
                    "id": "CDIWA0129E",
                    "source_cluster": "NULL",
                    "service_version": "1.0.471",
                    "source_component": "WDP-DataFlows",
                    "timestamp": "2017-12-19T19:52:09.438Z",
                    "transaction_id": "71c7d19b-a91b-40b1-9a14-4535d76e9e16",
                    "user": "demo_dataflow_user@mailinator.com"
                }
            }
        ]
    }
}

To get the logs produced by the engine, use the following API:

GET v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/logs?project_id={project_id}

Data Profiles

Introduction

Data profiles contains classification and information about the distribution of your data, which helps you to understand your data better and make the appropriate data shaping decisions.

Data profiles are automatically created when a data set is added to a catalog with data policy enforcement. The profile summary helps you in analyzing your data more closely and in deciding which cleansing operations on your data will provide the best results for your use-case. You can also perform CRUD operations on data profiles for data sets in catalogs or projects without data policy enforcement.

Create a data profile

You can use this API to:

  • Create a data profile
  • Create and execute a data profile

To create a data profile for a data set in a specified catalog or project and not execute it, call the following POST method:

POST /v2/data_profiles?start=false

OR

POST /v2/data_profiles

To create a data profile for a data set in a specified catalog or project and execute it, call the following POST method:

POST /v2/data_profiles?start=true

The minimal request payload required to create a data profile is as follows:

{
    "metadata": {
        "dataset_id": "{DATASET_ID}",
        "catalog_id": "{CATALOG_ID}"
    }
}

OR

{
    "metadata": {
        "dataset_id": "{DATASET_ID}",
        "project_id": "{PROJECT_ID}"
    }
}

The request payload can have an entity part which is optional:

    {
        "metadata": {
            "dataset_id": "{DATASET_ID}",
            "catalog_id": "{CATALOG_ID}"
        },
        "entity": {
            "data_profile": {
                "options": {
                    "max_row_count": {MAX_ROW_COUNT_VALUE},
                    "max_distribution_size": {MAX_SIZE_OF_DISTRIBUTIONS},
                    "max_numeric_stats_bins": {MAX_NUMBER_OF_STATIC_BINS},
                    "classification_options": {
                        "disabled": {BOOLEAN_TO_ENABLE_OR_DISABLE_CLASSIFICATION_OPTIONS},
                        "class_codes": {DATA_CLASS_CODE},
                        "items": {ITEMS}
                }
            }
        }
    }

The following parameters are required in the URI and the payload:

  1. start: Specifies whether to start the profiling service immediately after the data profile is created. The default is false.

  2. max_row_count: Specifies the maximum number of rows to perform profiling on. If no value is provided or if the value is invalid (negative), the default is to 5000 rows.

  3. row_percentage: Specifies the percentage of rows to perform profiling on. If no value is provided or if the value is invalid (<0 or>100).

  4. max_distribution_size: Specifies the maximum size of various distributions produced by the profiling process. If no value is provided, the default is 100.

  5. max_numeric_stats_bins: Specifies the maximum number of bins to use in the numerical statistics. If no bin size is provided, the default is 100 bins.

  6. classification_options: Specifies the various options available for classification.

    (i). disabled: If true, the classification options are disabled and default values are used.

    (ii). class_codes: Specifies the data class code to consider during profiling.

    (iii). items: Specifies the items.

    Note: You can get various data class codes through the data class service.

To create a data profile for a data set, the following steps must be completed:

  1. You must have a valid IAM token to make REST API calls and a project or catalog ID.

  2. You must have an IBM Cloud Object Storage bucket, which must be associated with your catalog in the project.

  3. The data set must be added to your catalog in the project.

  4. Construct a request payload to create a data profile with the values required in the payload.

  5. Send a POST request to create a data profile.

When you call the method, the payload is validated. If a required value is not specified or a value is invalid, you get a response message with an HTTP status code of 400 and information about the invalid or missing values.

The response of the method includes a location header with a value that indicates the location of the profile that was created. The response body also includes a field href which contains the location of the created profile.

The execution.status of the profile is none if the start parameter is not set or is set to false. Otherwise, it is in submitted state or any other state depending on the profiling execution status.

The following are possible response codes for this API call:

Response HTTP status Cause Possible Scenarios
201 Created A data profile was created.
400 Bad Request The request payload either had some invalid values or invalid/unwanted parameters.
401 Unauthorized Invalid IAM token was provided in the request header.
403 Forbidden User is not allowed to create a data profile.
500 Internal Server Error Some runtime error occurred.

Get a data profile

To get a data profile for a data set in a specified catalog or project, call the following GET method:

GET /v2/data_profiles/{PROFILE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

GET /v2/data_profiles/{PROFILE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The value of PROFILE_ID is the value of metadata.guid from the successful response payload of the create data profile call.

For other runtime errors, you might get an HTTP status code of 500 indicating that profiling didn't finished as expected.

The following are possible response codes for this API call:

Response HTTP status Cause Possible Scenarios
200 Success Data profile is created and executed.
202 Accepted Data profile is created and under execution.
401 Bad Request Invalid IAM token was provided in the request header.
403 Forbidden User is not allowed to get the data profile.
404 Not Found The data profile specified was not found.
500 Internal Server Error Some runtime error occurred.

Update a data profile

To update a data profile for a data set in a specified catalog or project, call the following PATCH method:

PATCH /v2/data_profiles/{PROFILE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

PATCH /v2/data_profiles/{PROFILE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The value of PROFILE_ID is the value of metadata.guid from the successful response payload of the create data profile call.

The JSON request payload must be as follows:


    [
      {
        "op": "add",
        "path": "string",
        "from": "string",
        "value": {}
      }
    ]

During update, the entire data profile is replaced, apart from any read-only or response-only attributes.

If profiling processes are running and the start parameter is set to true, then a data profile is only updated if the stop_in_progress_runs parameter is set to true.

The updates must be specified by using the JSON patch format, described in RFC 6902.

Modify asset level classification

This API is used for CRUD operations on asset level classification.

To modify the asset level classification details in the data_profile parameter for a data set in a specified catalog or project, call the following PATCH method:

PATCH /v2/data_profiles/classification?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

PATCH /v2/data_profiles/classification?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The JSON request payload must be structured in the following way:

    [
      {
        "op": "add",
        "path": "/data_classification",
        "value": [
            {
               "id":"{ASSET_LEVEL_CLASSIFICATION_ID}",
               "name":"{ASSET_LEVEL_CLASSIFICATION_NAME}"
            }
         ]
      }
    ]

The path attribute must be set to what is written in the previous JSON request payload, otherwise you will get a validation error with an HTTP status code of 400.

The values of ASSET_LEVEL_CLASSIFICATION_ID and ASSET_LEVEL_CLASSIFICATION_NAME can be: PII and PII details respectively.

The data updates must be specified by using the JSON patch format, described in RFC 6902 [https://tools.ietf.org/html/rfc6902]. For more details about JSON patch, see [http://jsonpatch.com].

A successful response has an HTTP status code of 200 and lists the asset level classifications.

The following are possible response codes for this API call:

Response HTTP status Cause Possible Scenarios
200 Success Asset Level Classification is added to the asset.
400 Bad Request The request payload either had some invalid values or invalid/unwanted parameters.
401 Unauthorized Invalid IAM token was provided in the request header.
403 Forbidden User is not allowed to add asset level classification to the asset.
500 Internal Server Error A runtime error occurred.

Delete a data profile

To delete a data profile for a data set in a specified catalog or project, call the following DELETE method:

DELETE /v2/data_profiles/{PROFILE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}&stop_in_progress_profiling_runs=false

OR

DELETE /v2/data_profiles/{PROFILE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}&stop_in_progress_profiling_runs=true

The value of PROFILE_ID is the value of metadata.guid from the successful response payload of the create data profile call.

You can't delete a profile if the profiling execution status is in running state and the query parameter stop_in_progress_profiling_runs is set to false.

A successful response has an HTTP status code of 204.

Troubleshooting your way out if something goes wrong

In case of failures of any of the API end points, if you are not able to pinpoint the issue from the error message received as to what went wrong (Mostly in cases of Internal Server Error 500 HTTP status code), you can retrieve the profiling data flow run logs and look at the all the steps behind the scenes to figure out what went wrong.

The possible scenarios can be that the profiling data flow didn't complete as the way we wanted it to. A common culprit is that profiling data flows are not able to connect to sources or targets based on the connection information that is specified in the request payload, which from a profiling perspective means that the connection was either not created for the catalog/project or the attachment for the data set has inconsistent interaction properties (in case of remote attachment).

To get the profiling data flow run logs, call the following GET method:

GET /v2/data_flows/{DATA_FLOW_ID}/runs/{DATA_FLOW_RUN_ID}/logs?catalog_id={CATALOG_ID}

OR

GET /v2/data_flows/{DATA_FLOW_ID}/runs/{DATA_FLOW_RUN_ID}/logs?project_id={PROJECT_ID}

The values of DATA_FLOW_ID and DATA_FLOW_RUN_ID would be present in the response payload for the GET profile call at the path: entity.data_profile.execution.dataflow_id and entity.data_profile.execution.dataflow_run_id respectively.

The response to the GET method includes information about each log event, including the event time, message type, and message text.

A maximum of 100 logs is returned per page. To specify a lower limit, use the limit query parameter with an integer value. More logs than those on the first page might be available. To get the next page, call a GET method using the value of the next.href member from the response payload.

Stream Flows

Introduction

The streams flow service provides APIs to create, update, delete, list, start, and stop stream flows.

A streams flow is a continuous flow of massive volumes of moving data that real-time analytics can be applied to. A streams flow can read data from a variety of sources, process that data by using analytic operations or your custom code, and then write it to one or more targets. You can access and analyze massive amounts of changing data as it is created. Regardless of whether the data is structured or unstructured, you can leverage data at scale to drive real-time analytics for up-to-the-minute business decisions.

The sources that are supported include Kafka, Message Hub, MQTT, and Watson IoT. Targets that are supported include Db2 Warehouse on Cloud, Cloud Object Storage, and Redis. Analytic operators that are supported include Aggregation, Python Machine Learning, Code, and Geofence.

Authorization

Authorization is done via Identity Access Management (IAM) bearer token. All API calls will require this Bearer token in the header.

Create a Streams Flow

1. Streaming Analytics instance ID

The streams flow is submitted to a Streaming Analytics service for compilation and running. When creating a flow, the Streaming Analytics instance ID must be provided. The instance ID can be found in the service credentials, which can be accessed from the service dashboard.

2. The pipeline graph

The streams flow represents it's source, targets, and operations in a pipeline graph. The pipeline graph can be generated by choosing the relevant operators in the Streams Designer canvas. To retrieve a pipeline graphcreated by the Streams Designer, use:

GET /v2/streams_flows/85be3e09-1c71-45d3-8d5d-220d6a6ea850?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218

This will return a streams flow containing a pipeline field in the entity. This pipeline object can be copied and submitted into another flow via:

POST /v2/streams_flows/?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218

Request Payload:

{
  "name": "My Streams Flow",
  "description": "A Sample Streams Flow.",
  "engines": {
      "streams": {
        "instance_id": "8ff81caa-1076-41ce-8de1-f4fe8d79e30e"
      }
  },
  "pipeline": {
        "doc_type": "pipeline",
        "version": "1.0",
        "json_schema": "http://www.ibm.com/ibm/wdp/flow-v1.0/pipeline-flow-v1-schema.json",
        "id": "",
        "app_data": {
            "ui_data": {
                "name": "mqtt 2"
             }
         },
         "primary_pipeline": "primary-pipeline",
         "pipelines": [
          {
             "id": "primary-pipeline",
             "runtime": "streams",
             "nodes": [
             {
                 "id": "messagehubsample_29xse4zvabe",
                 "type": "binding",
                 "op": "ibm.streams.sources.messagehubsample",
                 "outputs": [
                     {
                         "id": "target",
                         "schema_ref": "schema0",
                         "links": [
                         {
                           "node_id_ref": "mqtt_o6are9c4f",
                           "port_id_ref": "source"
                         }
                       ]
                     }
                   ],
                   "parameters": {
                     "schema_mapping": [
                       {
                         "name": "time_stamp",
                         "type": "timestamp",
                         "path": "/time_stamp"
                       },
                       {
                         "name": "customerId",
                         "type": "double",
                         "path": "/customerId"
                       },
                       {
                         "name": "latitude",
                         "type": "double",
                         "path": "/latitude"
                       },
                       {
                         "name": "longitude",
                         "type": "double",
                         "path": "/longitude"
                       }
                     ]
                   },
                   "connection": {
                     "ref": "EXAMPLE_MESSAGE_HUB_CONNECTION",
                     "project_ref": "EXAMPLE",
                     "properties": {
                       "asset": {
                         "path": "/geofenceSampleData",
                         "type": "topic",
                         "name": "Geospatial data",
                         "id": "geofenceSampleData"
                       }
                     }
                   },
                   "app_data": {
                     "ui_data": {
                       "label": "Sample Data",
                       "x_pos": 60,
                       "y_pos": 90
                     }
                   }
                 },
                 {
                   "id": "mqtt_o6are9c4f",
                   "type": "binding",
                   "op": "ibm.streams.targets.mqtt",
                   "parameters": {},
                   "connection": {
                     "ref": "cd5388c3-b203-4c77-803b-bc902d864a30",
                     "project_ref": "a912d673-54d3-4e5c-800f-5088554d3aa8",
                     "properties": {
                       "asset": "t"
                    }
                  },
                  "app_data": {
                    "ui_data": {
                      "label": "MQTT",
                      "x_pos": 420,
                      "y_pos": 90
                    }
                  }
                },
                {
                  "id": "mqtt_y84zc3vfche",
                  "type": "binding",
                  "op": "ibm.streams.sources.mqtt",
                  "outputs": [
                    {
                      "id": "target",
                      "schema_ref": "schema1",
                      "links": [
                        {
                           "node_id_ref": "debug_9avg3zdig25",
                           "port_id_ref": "source"
                         }
                       ]
                     }
                   ],
                   "parameters": {
                   "schema_mapping": [
                      {
                        "name": "time_stamp",
                        "type": "timestamp",
                        "path": "/time_stamp"
                      },
                      {
                        "name": "customerId",
                        "type": "double",
                        "path": "/customerId"
                      },
                      {
                        "name": "latitude",
                        "type": "double",
                        "path": "/latitude"
                      },
                      {
                        "name": "longitude",
                        "type": "double",
                        "path": "/longitude"
                      }
                    ]
                  },
                  "connection": {
                    "ref": "cd5388c3-b203-4c77-803b-bc902d864a30",
                    "project_ref": "a912d673-54d3-4e5c-800f-5088554d3aa8",
                    "properties": {
                      "asset": "t"
                    }
                  },
                  "app_data": {
                    "ui_data": {
                      "label": "MQTT",
                      "x_pos": -120,
                      "y_pos": -210
                    }
                  }
                },
                {
                  "id": "debug_9avg3zdig25",
                  "type": "binding",
                  "op": "ibm.streams.targets.debug",
                  "parameters": {},
                  "app_data": {
                    "ui_data": {
                      "label": "Debug",
                      "x_pos": 240,
                      "y_pos": -270
                    }
                  }
                }
              ]
            }
         ],
         "schemas": [
            {
              "id": "schema0",
              "fields": [
                 {
                      "name": "time_stamp",
                      "type": "timestamp"
                 },
                 {
                      "name": "customerId",
                      "type": "double"
                 },
                 {
                      "name": "latitude",
                      "type": "double"
                 },
                 {
                     "name": "longitude",
                     "type": "double"
                 }
             ]
           },
           {
             "id": "schema1",
             "fields": [
               {
                 "name": "time_stamp",
                 "type": "timestamp"
               },
               {
                 "name": "customerId",
                 "type": "double"
               },
               {
                 "name": "latitude",
                 "type": "double"
               },
               {
                 "name": "longitude",
                 "type": "double"
               }
             ]
           }
        ]
     }
}

Streams Flow Lifecycle

After a Streams Flow is created it will be in the STOPPED state unless it's been submitted as a job to be started. When starting a job, a Cloudant asset is created to track the status of the streams flow run. The start job operation can take up to minute to complete, during which time the streams flow will be in the STARTING state. Once the submission and compilation has completed, the streams flow will be in the RUNNING state.

To change the run state use the POST api:

POST /v2/streams_flows/85be3e09-1c71-45d3-8d5d-220d6a6ea850/runs?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218

Request Payload:

{
   "state": "started",
   "allow_streams_start": true
}
  • For starting the streams flow run, use { state: started }. To stop the flows run, use { state: stopped }.

  • Specify "allow_streams_start" to start the Streaming Analytics service in the event that it is stopped.

The start job operation triggers a long running process on the Streaming Analytics service instance. During this time the progress/status of this job can be viewed :

GET https://api.dataplatform.cloud.ibm.com/v2/streams_flows/85be3e09-1c71-45d3-8d5d-220d6a6ea850/runs?project_id=ff1ab70b-0553-409a-93f9-ccc31471c218

A version of the pipeline that has been deployed is saved to represent the Runtime Pipeline. The streams flow can still be edited in the Streams Designer, and it will not have an impact on the Runtime Pipeline that has been deployed, until the user stops the running flow, and starts it again..

Metadata Discovery

Metadata Discovery can be used to automatically discover assets from a connection. The connection used for a discovery run can be associated with a catalog or project, but new data assets will be created in a project. Each asset that is discovered from a connection is added as a data asset to the project.

For a list of the supported types of connections against which the Metadata Discovery service can be invoked, see Discover data assets from a connection.

In general, the discovery process takes a significant amount of time. Therefore, the API to create a discovery run actually only queues a discovery run and then returns immediately (typically before the discovery run is even started). Subsequent calls to different APIs can then be made to monitor the progress of the discovery run (see Monitoring a metadata discovery run and Retrieving discovered assets).

The following example shows a request to create a metadata discovery run. It assumes that a project, a connection, and a catalog have already been created, and that their IDs are known by the caller. If a catalog is provided (as in the following example), the connection is associated with the catalog. If no catalog is provided, the connection is associated with the project.

Note: In the following examples, the discovered assets are found in a connection to a DB2 database, but the details of the database are hidden within the connection. So, the caller of the data_discoveries API specifies the database to discover indirectly via the connection.

API request - Create discovery run:

POST /v2/data_discoveries

Request payload:

{
    "entity": {
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

In the example request payload, you can see the ID of the connection whose assets will be discovered, and the ID of the project into which the newly created assets will be added.

Response payload:

{
    "metadata": {
        "id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
        "invoked_by": "IBMid-50S...",
        "bss_account_id": "e348e...",
        "created_at": "2018-06-22T15:42:02.843Z"
    },
    "entity": {
        "status": "CREATED",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

In the response, you can see that the discovery run was created with the ID dcb8a234ad5e438d904a4cdbe0ba70e2, which you'll need to use if you want to get the status of the discovery run that you just created. Also shown in the response is:

  • invoked_by: the IAM ID of the account that kicked off the discovery process
  • bss_account_id: the BSS account ID of the catalog
  • created_at: the creation date and time of the discovery job

To get the status of a discovery run use the GET data_discoveries API. You can request the status of a discovery run as often as desired. In the following sections, you will be shown a few such calls to illustrate the progression of a discovery run.

API Request - Get status of discovery run:

GET /v2/data_discoveries/dcb8a234ad5e438d904a4cdbe0ba70e2

There is no request payload for the previous GET data_discoveries request. Instead, the ID of the discovery run whose status is being requested is supplied as a path parameter. In the previous URL, use the discovery run ID that was returned by the earlier call to POST data_discoveries. If you no longer have access to the ID of the discovery run for which you want to see status information, see the section Call Discovery API to get the ID of a metadata discovery run.

The following examples show various responses to the same GET data_discoveries monitor request previously shown, made at various points during the discovery run.

Response to status request immediately after creation of a discovery run:

{
    "metadata": {
        "id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
        "invoked_by": "IBMid-50S...",
        "bss_account_id": "e348e...",
        "created_at": "2018-06-22T15:42:02.843Z"
    },
    "entity": {
        "status": "CREATED",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

In the previous response, you can see that the status of the discovery run has not yet changed - it is still CREATED. This is because the request to discover assets is put into a queue and will be initiated in the order in which it was received.

Response to status request immediately after a discovery run has actually started:

{
    "metadata": {
        "id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
        "invoked_by": "IBMid-50S...",
        "bss_account_id": "e348e...",
        "created_at": "2018-06-22T15:42:02.843Z",
        "started_at": "2018-06-22T15:42:06.167Z",
        "ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
    },
    "entity": {
        "statistics": {

        },
        "status": "RUNNING",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

Now notice that the status has changed to RUNNING which indicates that the discovery process has actually started. Also, the metadata field has some additional fields added to it:

  • started_at: the date and time at which the discovery run started
  • ref_project_connection_id: a reference to a cloned project connection ID, internally set when a discovery is created for a connection in a catalog

In addition, notice that a new statistics object was introduced into the response body. In the response, that object is empty because the discovery run, which has just started hasn't yet discovered any assets.

Response to status request after some assets were discovered:

{
    "metadata": {
        "id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
        "invoked_by": "IBMid-50S...",
        "bss_account_id": "e348e...",
        "created_at": "2018-06-22T15:42:02.843Z",
        "started_at": "2018-06-22T15:42:06.167Z",
        "discovered_at": "2018-06-22T15:42:27.970Z",
        "ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
    },
    "entity": {
        "statistics": {
            "discovered": 128,
            "submit_succ": 128
        },
        "status": "RUNNING",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

Notice the statistics object now contains two fields:

  • discovered: the number of assets discovered so far during the discovery run
  • submit_succ: the number of assets successfully submitted for creation so far during the discovery run. A discovered asset goes through an internal pipeline with various stages from being discovered at the connection to being created in the project. Here, submitted means the asset was submitted to the internal pipeline.

Refer to Watson Data API schema for the complete list of the possible fields that might show up in the statistics object.

Because the discovery run isn't yet finished, the status in the previous response is still RUNNING.

Response to status request after the discovery run was completed:

{
    "metadata": {
        "id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
        "invoked_by": "IBMid-50S...",
        "bss_account_id": "e348e...",
        "created_at": "2018-06-22T15:42:02.843Z",
        "started_at": "2018-06-22T15:42:06.167Z",
        "discovered_at": "2018-06-22T15:42:27.970Z",
        "processed_at": "2018-06-22T15:42:45.877Z",
        "finished_at": "2018-06-22T15:43:14.969Z",
        "ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
    },
    "entity": {
        "statistics": {
            "discovered": 179,
            "submit_succ": 179,
            "create_succ": 179
        },
        "status": "COMPLETED",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

Notice the status field has changed to COMPLETED to indicate that the discovery run is finished. Other response fields to note:

  • finished_at: the date and time at which the discovery run finished
  • discovered: indicates that 179 assets were discovered at the connection
  • submit_succ: indicates that 179 of the discovered assets were successfully submitted to the discovery run's internal asset processing pipeline.
  • create_succ: indicates that 179 assets were successfully created in the project

At any time during or after a discovery run, you call Asset APIs to get the list of metadata for the currently discovered assets in the project. To retrieve metadata for any list of assets you can make the following call:

POST /v2/asset_types/{type_name}/search?project_id={project_id}

More specifically, to find the metadata for discovered assets the value to use for the {type_name} path parameter is discovered_asset. So, for the discovery run we created, the call to retrieve metadata for the discovered assets would look like this:

API Request - Get metadata for discovered assets:

POST /v2/asset_types/discovered_asset/search?project_id=960f6aff-295f-4de1-a9d7-f3b6805b3590

where the project_id query parameter value 960f6aff-295f-4de1-a9d7-f3b6805b3590 is the same value that was specified in the body of the POST request that was used to create the discovery run.

In addition, the ID of the connection that the discovery was run against has to be specified in the body of the POST, like this:

{
    "query": "discovered_asset.connection_id:\"f638398f-fcc7-4856-b78d-5c8efa5b9282\""
}

Here is part of the response body for the previous query:

{
    "total_rows": 179,
    "results": [
        {
            "metadata": {
                "name": "EMP_SURVEY_TOPIC_DIM",
                "description": "Warehouse table EMP_SURVEY_TOPIC_DIM describes employee survey questions for employees of the Great Outdoors Company, in supported languages.",
                "tags": [
                    "discovered",
                    "GOSALESDW"
                ],
                "asset_type": "data_asset",
                "origin_country": "ca",
                "rating": 0.0,
                "total_ratings": 0,
                "sandbox_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590",
                "catalog_id": "a682c698-6019-437d-a0b9-224aa0a4dbc9",
                "created": 0,
                "created_at": "2018-06-22T15:41:47Z",
                "owner": "abc123@us.ibm.com",
                "owner_id": "IBMid-50S...",
                "size": 0,
                "version": 0.0,
                "usage": {
                    "last_update_time": 1.52968210955E12,
                    "last_updater_id": "iam-ServiceId-87f49...",
                    "access_count": 0.0,
                    "last_accessor_id": "iam-ServiceId-87f49...",
                    "last_access_time": 1.52968210955E12,
                    "last_updater": "ServiceId-87f49...",
                    "last_accessor": "ServiceId-87f49..."
                },
                "asset_state": "available",
                "asset_attributes": [
                    "data_asset",
                    "discovered_asset"
                ],
                "rov": {
                    "mode": 0
                },
                "asset_category": "USER",
                "asset_id": "e35cfd4d-590f-40a5-b75c-ec07c0a4bcbc"
            }
        },
        .
        .
        .
}

Notice that the total_rows value 179 matches the create_succ value that was returned in the result of the API call to get the final status of the completed discovery run.

The results array in the previous response body has an entry containing metadata for each asset that was discovered by the discovery run. In the previous code snippet, for brevity, only 2 of the 179 entries are shown. The metadata created by the discovery run includes:

  • name: in this case, the name of the DB2 table that was discovered
  • description: a description of the table as provided by DB2
  • tags: these are useful for searching. The discovered tag is one of the tags set for a discovered asset.
  • asset_type: the type of the asset that was created in the project

Each entry in the results array also contains an href field that points to the actual asset that was created by the discovery run.

There might be times in which you no longer have the ID of the metadata discovery run whose status you're interested in, and so might not be able to call the following API for the specific discovery run you're interested in (which requires that ID):

GET /v2/data_discoveries/dcb8a234ad5e438d904a4cdbe0ba70e2

The following example illustrates how to get the IDs of metadata discovery runs for the connection and catalog that were used in the previous call to create a discovery run:

API Request - Get information for discovery runs:

GET /v2/data_discoveries?offset=0&limit=1000&connection_id=f638398f-fcc7-4856-b78d-5c8efa5b9282&catalog_id=816882fa-dcda-46e1-8c6b-fa23c3cbad14

Note that the values of the query parameters connection_id and catalog_id correspond to the values for the identically named fields in the payload for the previous request to create a discovery run.

Notice also that you can use the offset and limit query parameters to focus on a particular subset of the full list of related discoveries.

The response payload will look like this:

{
    "resources": [
        {
            "metadata": {
                "id": "dcb8a234ad5e438d904a4cdbe0ba70e2",
                "invoked_by": "IBMid-50S...",
                "bss_account_id": "e348e...",
                "created_at": "2018-06-22T15:42:02.843Z",
                "started_at": "2018-06-22T15:42:06.167Z",
                "discovered_at": "2018-06-22T15:42:27.970Z",
                "processed_at": "2018-06-22T15:42:45.877Z",
                "finished_at": "2018-06-22T15:43:14.969Z",
                "ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
            },
            "entity": {
                "statistics": {
                    "discovered": 179,
                    "submit_succ": 179,
                    "create_succ": 179
                },
                "status": "COMPLETED",
                "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
                "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
                "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
            }
        }
    ],
    "first": {
        "href": "http://localhost:9080/v2/data_discoveries?offset=0&limit=1000&connection_id=f638398f-fcc7-4856-b78d-5c8efa5b9282&catalog_id=816882fa-dcda-46e1-8c6b-fa23c3cbad14"
    },
    "next": {
        "href": "http://localhost:9080/v2/data_discoveries?offset=1000&limit=1000&connection_id=f638398f-fcc7-4856-b78d-5c8efa5b9282&catalog_id=816882fa-dcda-46e1-8c6b-fa23c3cbad14"
    },
    "limit": 1000,
    "offset": 0
}

Anything that is found because it matches the query criteria is returned in the resources array. In the previous response, there is only one entry and it corresponds to the discovery run which was created in the previous Create a metadata discovery run section.

There might be times when you want to stop a discovery run before it's completed. To do so, use the PATCH data_discoveries API. The following illustrates how to abort a discovery run (a different discovery run than the one used in the previous examples):

API Request - Abort a discovery run:

PATCH /v2/data_discoveries/09cbff0981f84c51be4b4d93becc17b0

The previous PATCH request requires the following request body to set the status of the discovery run to "ABORTED":

{
    "op": "replace",
    "path": "/entity/status",
    "value": "ABORTED"
}

The response payload will look like this:

{
    "metadata": {
        "id": "09cbff0981f84c51be4b4d93becc17b0",
        "invoked_by": "IBMid-50S...",
        "bss_account_id": "e348e...",
        "created_at": "2018-06-22T15:45:54.638Z",
        "started_at": "2018-06-22T15:45:56.202Z",
        "finished_at": "2018-06-22T15:46:02.274Z",
        "ref_project_connection_id": "2526ed95-dedd-4904-bb31-c06d9cb1e105"
    },
    "entity": {
        "statistics": {

        },
        "status": "ABORTED",
        "connection_id": "f638398f-fcc7-4856-b78d-5c8efa5b9282",
        "catalog_id": "816882fa-dcda-46e1-8c6b-fa23c3cbad14",
        "project_id": "960f6aff-295f-4de1-a9d7-f3b6805b3590"
    }
}

Notice in the previous response payload that the status has now been set to ABORTED.

Any assets discovered before the run was aborted will remain discovered. In the example, the abort occurred so quickly after the creation of the discovery run that no assets had been discovered, hence the statistics object is empty.

Data Samples

Introduction

Data samples are a representative subset of a data set before you begin processing the entire data set. Creating a data sample enables you to test and refine the operations that cleanse and shape data on a smaller portion of the data set. Working on a data subset helps you determine the quality and appropriateness of your data transformations for the type of data analysis you plan before you run those operations on the entire data set.

Create a data sample

To create a data sample for a data set in a specified catalog or project, call the following POST method:

POST /v2/data_samples

The JSON request payload for a data set in a catalog must be structured in the following way:

    {
         "dataset_id": "{DATASET_ID}",
         "catalog_id": "{CATALOG_ID}",
         "algorithm": {
              "type": "{TYPE_OF_ALGORITHM}",
              "seed": {SEED_VALUE},
              "fraction": {FRACTION_VALUE}
         }
    }

The JSON request payload for a data set in a project must be structured in the following way:

    {
         "dataset_id": "{DATASET_ID}",
         "project_id": "{PROJECT_ID}",
         "algorithm": {
              "type": "{TYPE_OF_ALGORITHM}",
              "seed": {SEED_VALUE},
              "max_rows": {MAX_ROWS_VALUE}
         }
    }

You can either use fraction in the algorithm field of the payload or max_rows to limit the size of the sample that you want to create. Notice that these fields are optional, including the type field which specifies which algorithm is to be used for sampling. By default, it takes RANDOM algorithm if not otherwise specified. Currently, RANDOM algorithm is the only supported algorithm.

To create a data sample for a data set, perform the following steps:

  1. You must have a valid IAM token to make REST API calls and a project or catalog ID.

  2. You must have an IBM Cloud Object Storage bucket, which you must associate with your catalog in the project.

  3. The data set must be added to your catalog in the project.

  4. Construct a request payload for creating a data sample with the values required in the payload.

  5. Send a POST request to create the sample.

When you call the method, the payload is validated. If a required value is not specified or a value is invalid, you get a response message with an HTTP status code of 400 and information about the invalid or missing values.

The response of the method includes a location header with a value that indicates the location of the sample that was created. The response body also includes a field entity.href which contains the location of the created sample.

The following example shows a success response:

    {
        "metadata": {
            "asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
            "dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
            "catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
            "owner": "ibmbluemix001@gmail.com",
            "created_at": "2017-09-09T16:04:33.238Z"
        },
        "entity": {
            "data_sample": {
                "catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
                "dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
                "algorithm": {
                    "type": "RANDOM",
                    "with_replacement": false,
                    "seed": 1,
                    "max_rows": 10000
                },
                "href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/data_samples/93d3b425-9569-4f70-a53f-9192814769bd?dataset_id=a0572944-86a6-49ee-9da3-cb45d73e8d8a&catalog_id=3239e296-5aba-4256-aa9a-bfcf7b974e23",
                "sample_execution": {
                    "status": "initiated"
                }
            }
        }
    }

List all data samples

To list all data samples for a data set in a specified catalog or project, call the following GET method:

GET v2/data_samples?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

GET v2/data_samples?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The following example shows a success response:

    {
        "resources": [
            {
                "metadata": {
                    "asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
                    "dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
                    "catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
                    "owner": "ibmbluemix001@gmail.com",
                    "created_at": "2017-09-09T16:04:33.238Z"
                },
                "entity": {
                    "data_sample": {
                        "algorithm": {
                            "type": "RANDOM",
                            "with_replacement": false,
                            "seed": 1,
                            "max_rows": 10000
                        },
                        "sample_execution": {
                            "activity_id": "828b28dd-8cc3-4a29-b42c-9dbdc737aa98",
                            "activity_run_id": "0f54a748-2a73-43f6-b8b5-1da9a2b160ba",
                            "status": "finished"
                        }
                    }
                }
            }
        ]
    }

Get the data sample for a data set

To get a data sample for a data set in a specified catalog or project, call the following GET method:

GET /v2/data_samples/{SAMPLE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

GET /v2/data_samples/{SAMPLE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The value of SAMPLE_ID is the value of metadata.asset_id from the successful response payload of the create sample call.

The following example shows a success response:

    {
        "metadata": {
            "asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
            "dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
            "catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
            "owner": "ibmbluemix001@gmail.com",
            "created_at": "2017-09-09T16:04:33.238Z"
        },
        "entity": {
            "data_sample": {
                "algorithm": {
                    "type": "RANDOM",
                    "with_replacement": false,
                    "seed": 1,
                    "max_rows": 10000
                },
                "sample_execution": {
                    "activity_id": "828b28dd-8cc3-4a29-b42c-9dbdc737aa98",
                    "activity_run_id": "0f54a748-2a73-43f6-b8b5-1da9a2b160ba",
                    "status": "finished"
                }
            }
        }
    }

Get the data in a data sample

To get the data in a data sample in a specified catalog or project, call the following GET method:

GET /v2/data_samples/{SAMPLE_ID}/data?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}&_limit={LIMIT}&_offset={OFFSET}

OR

GET /v2/data_samples/{SAMPLE_ID}/data?project_id={PROJECT_ID}&dataset_id={DATASET_ID}&_limit={LIMIT}&_offset={OFFSET}

The value of SAMPLE_ID is the value of metadata.asset_id from the successful response payload of the create sample call.

Update a data sample

To update a data sample for a data set in a specified catalog or project, call the following PATCH method:

PATCH /v2/data_samples/{SAMPLE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

PATCH /v2/data_samples/{SAMPLE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The value of SAMPLE_ID is the value of metadata.asset_id from the successful response payload of the create sample call.

The JSON request payload to change the seed value must be structured in the following way:

    [{
        "op":"replace",
        "path":"/entity/data_sample/algorithm/seed",
        "value":10
    }]

This API does not allow you to update the data sample metadata, for example, the creation time or modification time, or the creator of the sample. Also, you are not allowed to modify any data sample container details like the attachment URL.

However, you can modify the algorithm parameter of the sample by specifying a new seed value, the with_replacement and the fraction attributes.

If the sampling process is still running, the data sample is not updated unless the stop_in_progress_runs parameter is set to true. To start the sampling process again as soon as the sample is updated, set the start parameter to true.

The data updates must be specified by using the JSON patch format, described in RFC 6902.

The following example shows a success response:

    {
        "metadata": {
            "asset_id": "93d3b425-9569-4f70-a53f-9192814769bd",
            "dataset_id": "a0572944-86a6-49ee-9da3-cb45d73e8d8a",
            "catalog_id": "3239e296-5aba-4256-aa9a-bfcf7b974e23",
            "owner": "ibmbluemix001@gmail.com",
            "created_at": "2017-09-09T16:04:33.238Z"
        },
        "entity": {
            "data_sample": {
                "algorithm": {
                    "type": "RANDOM",
                    "with_replacement": false,
                    "seed": 10,
                    "max_rows": 10000
                },
                "sample_execution": {
                    "activity_id": "828b28dd-8cc3-4a29-b42c-9dbdc737aa98",
                    "activity_run_id": "0f54a748-2a73-43f6-b8b5-1da9a2b160ba",
                    "status": "finished"
                }
            }
        }
    }

Delete a data sample

To delete a data sample for a data set in a specified catalog or project, call the following DELETE method:

DELETE /v2/data_samples/{SAMPLE_ID}?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

DELETE /v2/data_samples/{SAMPLE_ID}?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

The value of SAMPLE_ID is the value of metadata.asset_id from the successful response payload of the create sample call.

A successful response is received with an HTTP status code of 204.

Delete all data samples

To delete all data samples for a data set in a specified catalog or project, call the following DELETE method:

DELETE /v2/data_samples?catalog_id={CATALOG_ID}&dataset_id={DATASET_ID}

OR

DELETE /v2/data_samples?project_id={PROJECT_ID}&dataset_id={DATASET_ID}

A successful response is received with an HTTP status code of 204.

Troubleshooting your way out if something goes wrong

In case of failures of any of the API endpoints, if you are not able to pinpoint the issue from the error message received as to what went wrong (Mostly in cases of Internal Server Error 500 HTTP status code), you can retrieve the activity run logs and look at the all the steps behind the scenes to figure out what went wrong.

The possible scenarios can be that activity didn't complete as the way we wanted it to, or it finished with errors, or it was aborted, etc. A common culprit is that activities are not able to connect to sources or targets based on the connection information that is specified in the request payload, which from a sampling perspective means that the connection was either not created for the catalog/project or the attachment for the data set has inconsistent interaction properties (in case of remote attachment).

To get the activity run logs, call the following GET method:

GET /v2/activities/{ACTIVITY_ID}/activityruns/{ACTIVITY_RUN_ID}/logs?catalog_id={CATALOG_ID}

OR

GET /v2/activities/{ACTIVITY_ID}/activityruns/{ACTIVITY_RUN_ID}/logs?project_id={PROJECT_ID}

To values of ACTIVITY_ID and ACTIVITY_RUN_ID would be present in the response payload for the GET sample call at the path: entity.data_sample.sample_execution.activity_id and entity.data_sample.sample_execution.activity_run_id respectively.

The response to the GET method includes information about each log event, including the event time, message type, and message text.

A maximum of 100 logs is returned per page. To specify a lower limit, use the _limit query parameter with an integer value. More logs than those on the first page might be available. To get the next page, call a GET method using the value of the next.href member from the response payload. This URI includes the _start query parameter which contains the next page bookmark token.

Lineage

Introduction

The lineage of an asset includes information about all events, and other assets, that have led to its current state and its further usage. Asset and Event are the two main entities that are part of the lineage data model. An asset can either be generated from or used in subsequent events. An event can be any of:

  • asset-generation-events
  • asset-modification-events
  • asset-usage-events.

Use the Lineage API to publish events on an asset or to query the lineage of an asset.

Publish a lineage event

The following example shows a sample lineage event that can be posted when a data set is published from a project to a catalog:

Request URL

POST /v2/lineage_events

Request Body
{
  "message_version": "v1",

  "user_id": "IAM-Id_of_User",
  "account_id": "e86f2b06b0b267d559e7c387ceefb089",

  "event_details": {
    "event_id": "sample-event1",
    "event_type": "DATASET_PUBLISHED",
    "event_category": [
      "additions"
    ],
    "event_time": "2018-04-03T14:01:08.603Z",
    "event_source_service": "Watson Knowledge Catalog"
  },

  "generates_assets": [
    {
      "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
      "asset_type": "DataSet",
      "relation": {
        "name": "Created"
      },

      "properties": {
        "dataset": {
          "type": "dataset",
          "value": {
            "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
            "name": "Asset Name in Catalog XX",
            "catalog_id": "9f9c961a-78d1-4c06-a601-4b589catalog"
          }
        },
        "catalog": {
          "type": "catalog",
          "value": {
            "id": "9f9c961a-78d1-4c06-a601-4b589catalog"
          }
        }
      }
    }
  ],
  "uses_assets": [
    {
      "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
      "asset_type": "DataSet",
      "relation": {
        "name": "Used"
      },

      "properties": {
        "dataset": {
          "type": "dataset",
          "value": {
            "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
            "name": "2017_sales_data",
            "project_id": "9f9c961a-78d1-4c06-a601-4b589project"
          }
        },
        "project": {
          "type": "project",
          "value": {
            "id": "9f9c961a-78d1-4c06-a601-4b589project"
          }
        }
      }
    }
  ]
}

Response Body

{
  "metadata": {
    "id": "01014d1f-31cf-4956-bd41-7a77ba14004c",
    "source_event_id": "sample-event1"
  }
}

The id generated in the response can be used to query the details of the published event with the following request:

Request URL

GET v2/lineage_events/01014d1f-31cf-4956-bd41-7a77ba14004c

For more details on each field in the lineage event JSON payload, refer to the Lineage Events section of API documentation.

Query lineage of an asset

The lineage of an asset involved in the sample event can be queried using the following request:

Request URL

GET v2/asset_lineages/9f9c961a-78d1-4c06-a601-4b5890fdataset03

Response Body

{
  "resources": [
    {
      "metadata": {
        "id": "01014d1f-31cf-4956-bd41-7a77ba14004c",
        "source_event_id": "sample-event1",
        "created_at": "2018-04-03T14:01:08.603Z",
        "created_by": "IAM-Id_of_User"
      },
      "entity": {
        "type": "DATASET_PUBLISHED",
        "generates_assets": [
          {
            "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
            "type": "DataSet",
            "relation": {
              "name": "Created"
            },
            "properties": {
              "catalog": {
                "type": "catalog",
                "value": {
                  "id": "9f9c961a-78d1-4c06-a601-4b589catalog"
                }
              },
              "dataset": {
                "type": "dataset",
                "value": {
                  "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
                  "name": "Asset Name in Catalog XX",
                  "catalog_id": "9f9c961a-78d1-4c06-a601-4b589catalog"
                }
              }
            }
          }
        ],
        "uses_assets": [
          {
            "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
            "type": "DataSet",
            "relation": {
              "name": "Used"
            },
            "properties": {
              "dataset": {
                "type": "dataset",
                "value": {
                  "id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
                  "name": "2017_sales_data",
                  "project_id": "9f9c961a-78d1-4c06-a601-4b589project"
                }
              },
              "project": {
                "type": "project",
                "value": {
                  "id": "9f9c961a-78d1-4c06-a601-4b589project"
                }
              }
            }
          }
        ],
        "properties": {
          "event_time": "2018-04-03T14:01:08.603Z",
          "event_category": [
            "additions"
          ],
          "event_source_service": "Watson Knowledge Catalog"
        }
      }
    }
  ],
  "limit": 50,
  "offset": 0,
  "first": {
    "href": "https://api.dataplatform.cloud.ibm.com/v2/asset_lineages/9f9c961a-78d1-4c06-a601-4b5890fdataset03?offset=0&_=1528182675331"
  }
}

Methods

Get asset lineage

Returns lineage of the asset identified by asset_id. The lineage includes a list of events associated with the asset. A different list of events are returned based on the value of lineage_type query parameter.

  • lineage_type = default | partial - The event generating the asset, the events happening on the asset and the events using the asset to generate other assets.
  • lineage_type = forward - The event generating the asset and the events using the asset directly or indirectly to generate other assets.
  • lineage_type = forward_hop - The event generating the asset and the events using the asset directly or indirectly to generate other assets. The number of events returned is controlled by hop_count and event_count parameters.
  • lineage_type = backward - All the direct and indirect events responsible for generating the asset.
GET /v2/asset_lineages/{asset_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Asset ID

Query Parameters

  • The type of the asset. The value can be one of dataset, model. If the seed asset ID is supplied, then the value supplied should be the type of the seed asset.

  • The type of lineage to be returned. The default value is partial

    Allowable values: [forward,forward_hop,backward,partial]

  • The type of event classification. By default, the events are not classfied.

    Allowable values: [timelines]

  • If the asset is a catalog-asset, specify the ID of the catalog. If the seed asset ID is supplied, then the value supplied should be the catalog ID of the seed asset.

  • If the asset is a project-asset, specify the ID of the project. If the seed asset ID is supplied, then the value supplied should be the project ID of the seed asset.

  • If the asset is a model, specify the ID of the associated WML service instance. If the seed asset ID is supplied, then the value supplied should be the WML service instance ID of the seed asset.

  • The time after which the events have to be fetched. The format should be yyyy-MM-dd'T'HH:mm:ss.SSSX. The default value is 1970-01-01T00:00:00.000Z.

  • Used to control the number of events returned when lineage_type is forward_hop. The events returned will belong to a maximum of N other assets directly or indirectly generated using the asset where N is equal to hop_count. The default value is 4.

  • Used to control the number of events returned when lineage_type is forward_hop. The maxiumum number of events of each other asset generated directly or indirectly using the asset. The default value is 10.

  • The seed asset ID. The request will not be authorized if the lineage of the asset is not part of the lineage of the seed asset.

  • The maximum number of events returned. The default value is 50.

  • The index of the first matching event to be included in the result. The default value is 0.

Response

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Get asset lineage summary

Returns lineage summary of the asset identified by asset_id.

GET /v2/asset_lineages/{asset_id}/summary
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Asset ID

Query Parameters

  • The type of the asset. The value can be one of dataset, model.

  • If the asset is a catalog-asset, specify the ID of the catalog.

  • If the asset is a project-asset, specify the ID of the project.

  • If the asset is a model, specify the ID of the associated WML service instance.

Response

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Publish lineage event

Publishes lineage event.

POST /v2/lineage_events
Request

Custom Headers

  • IBM IAM access token

Event message json

Response

Status Code

  • successful operation

  • Accepted

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Get lineage event

Returns the event identified by event_id.

GET /v2/lineage_events/{event_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Event ID

Response

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Create project as a transaction

Creates a new project with the provided parameters, including all the storage and credentials in a single transaction. This endpoint will create a new COS bucket using generated unique name, all credentials, asset container and call all the required atomic APIs to fully configure a new project. Attempts to use duplicate project name will result in an error. NOTE: when creating projects programmatically, always use this endpoint, not /v2/projects.

POST /transactional/v2/projects
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with IAM authentication services.

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with IBM Cloud UAA authentication services.

  • Should be application/json

A project object

Response

Status Code

  • Created

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

No Sample Response

This method does not specify any sample responses.

Delete project as a transaction

Deletes a project with a given guid, deletes COS bucket and all the files in it, all credentials and asset container in the order reverse from the project creation transaction. When deleting projects programmatically, always use this endpoint, not /v2/projects/{guid}.

DELETE /transactional/v2/projects/{guid}
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with IAM authentication services.

    Default: Bearer token

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with IBM Cloud UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID for the project to be deleted.

Response

Status Code

  • No Content

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Get all projects

Returns a list of projects that are meeting the provided query parameters. By default, the list returns projects that the authenticated user is a member of.

GET /v2/projects
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Query Parameters

  • This parameter should be used when text search is used (match parameter is regexp). Once the first set of results is returned, 'bookmark' will be one of the properties in the response. Use this value to pass it to the subsequent requests as the 'bookmark' query parameter.

  • Filters the result list to only include projects whose guid matches those in the list.

  • Instructs the API to return the specified content according to the comma-separated list of sections.

    Values:
    • fields (default)
    • members
    • assets
    • everything
    • nothing

  • Limit the number of results returned when more than one projects are expected. Valid values are 0 <= limit <= 100 and the default is 10.

  • Filters the result list to only include projects whose name match this parameter. The value of the parameter will be treated in accordance to the value of the match parameter.

  • Used in conjunction with the name parameter. Currently the only value excepted is "exact".

    Values:
    • exact (default)
    • regexp

  • Filters the result list to only include projects where the user with a matching user id is a member.

  • Must be used in conjunction with the member query parameter. Filters the result set to include only projects where the specified member has one of the roles specified.

    Values:
    • admin
    • editor
    • viewer

  • Initial offset used in conjunction with 'limit' for pagination.

  • Used to achieve higher query performance at the cost of potentially not retrieving the latest data. If true, use cached query results. Default: false.

  • Filters results based on the client-defined tags associated with projects.

Response

Status Code

  • OK

  • No Content. Returned from GET when it is succeeds but 'include' property is set to 'nothing' or no projects exist.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

No Sample Response

This method does not specify any sample responses.

Get project

Returns data for a single project identified by the guid.

GET /v2/projects/{guid}
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

Query Parameters

  • Instructs the API to return the specified content according to the comma-separated list of sections.

    Values:
    • fields (default)
    • members
    • assets
    • everything
    • nothing

Response

Status Code

  • OK

  • No Content. Returns from GET when it succeeds but there is no content to return (when 'include=nothing' is being used)

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Update project

Partially updates the project with only a subset of properties.

PATCH /v2/projects/{guid}
Request

Custom Headers

  • Should be application/json

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID of the project being updated.

Response

Status Code

  • OK

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Get members

Returns the list of project members.

GET /v2/projects/{guid}/members
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID of a project.

Query Parameters

  • If provided, filters the list of members to only members that match one of the provided roles. The roles should be provided as a comma-separated list of legal role names

    Values:
    • viewer
    • editor
    • admin

    Allowable values: [viewer,editor,admin]

  • If provided, filters the list of members to only members that match one of the provided user_names.

Response

Status Code

  • OK

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Create members

Adds new project members with the provided roles. The members will be referenced by the guids of users in Blue ID. Note: there must always be at least one admin for each project.

POST /v2/projects/{guid}/members
Request

Custom Headers

  • Should be application/json

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID for a project.

Response

Status Code

  • OK

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Update members

Change project member roles in a batch.

PATCH /v2/projects/{guid}/members
Request

Custom Headers

  • Should be application/json

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID of a project.

Response

Status Code

  • OK

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Delete members

Deletes members from the project that match the provided usernames.

DELETE /v2/projects/{guid}/members
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID of a project.

Query Parameters

  • A comma-separated list of user names of the project members that should be removed from the project.

Response

Status Code

  • No Content

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Get member

Returns the project member with the specified 'user_name' if any.

GET /v2/projects/{guid}/members/{user_name}
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID of a project

  • A username for a project member.

Response

Status Code

  • OK

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

Delete member

Deletes a member with a given user name from a project with the give username.

DELETE /v2/projects/{guid}/members/{user_name}
Request

Custom Headers

  • Authorization value should use the format Bearer [token] - where [token] is an opaque access token obtained by authenticating with either IAM ID or IBM BlueID or Bluemix UAA authentication services.

    Default: Bearer token

Path Parameters

  • The GUID of a project.

  • The username of the member to be deleted.

Response

Status Code

  • No Content

  • Bad Request. Returned when the request parameters/body are invalid.

  • Unauthorized. Returned when a user makes a request with proper Authorization header.

  • Forbidden. Returned when a user makes a request to something they are not entitled to.

  • Not Found. Object does not exist.

No Sample Response

This method does not specify any sample responses.

list all governance types

Lists all governance types. A governance type groups like operations together. For example, the Access governance type groups operations related to access of assets.

GET /v2/governance_types
Request

Custom Headers

  • IBM IAM access token

Response

Response for the /v2/governance_types API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

retrieve a governanceType

Retrieves detailed information on a governance type. This includes all of the operations defined for the governance type and the allowed and default outcomes of each operation.

GET /v2/governance_types/{governance_type_name}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • the name of the governance type

Response

Response for the /v2/governance_types/{governanceTypeId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

list all policies

Lists all defined policies. This will include draft and archived policies as well as active policies. When more than one filter criteria is specified, the resulting collection satisfies all the criteria.

GET /v2/policies
Request

Custom Headers

  • IBM IAM access token

Query Parameters

  • Specify name of the policy to search for or use filter of the form ‘contains:xx’ to search for policies containing provided phrase as part of name or use filter of the form ‘exact:xx’ to search for policies with exact name.

  • Specify description of the policy to search for or use filter of the form ‘contains:xx’ to search for policies containing provided phrase as part of description.

  • Specify the rule id to search for policies containing the rule.

  • Specify the category id to search for policies mapped to the category.

  • Specify the state to search for policies in that state.

    Allowable values: [draft,active,archived]

  • The order to sort the policies. The following values are allowed:

    • name, -name -- ascending or descending order by the name
    • modified_date, -modified_date -- ascending or descending order by modified date
    • state, -state -- ascending or descending order by policy state. Ascending order for state is draft, active, archived.

    Allowable values: [name,-name,modified_date,-modified_date,state,-state]

  • The maximum number of Policies to return. The default value is 50.

  • The index of the first matching Policy to include in the result. The default value is 0.

Response

Response for the /v2/policies API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

create a policy

Creates a policy. This can include a list of existing rules to be associated with the policy. The existing rules can be passed as array of Rule objects or array of string that has rule ids. If state is not specified, the policy is created in draft state and must later be updated to active state to be evaluated and participate in policy enforcement.

Maximum length allowed for 'name' parameter: 80 characters, maximum length allowed for 'description' parameter: 1000 characters. Allowed characters for the 'name' parameter: letters from any language, numbers in any script, space, dot, underscore, hyphen. Strings with characters other than these are rejected (only for the name parameter).

POST /v2/policies
Request

Custom Headers

  • IBM IAM access token

policy json

Response

Response for the /v2/policies/{policyId} API

Status Code

  • successful operation

  • Created

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

return counts of total and active policies and rules

Returns counts of the total and active policies as well as counts of the rules in those policies. When a rule is used within multiple policies, it will only be counted once.

GET /v2/policies/counts
Request

Custom Headers

  • IBM IAM access token

Response

Response for the /v2/policies/counts API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

retrieve a policy

Retrieves detailed information on a policy given the policy's identifier. This includes a list of rules and subpolicies associated with this policy.

GET /v2/policies/{policy_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • policy ID

Response

Response for the /v2/policies/{policyId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

update a policy

Updates a policy. This API is also used to add and remove rules and subpolicies from the policy. The rules can be passed as an array of string that has rule ids.Only policies in draft state can be modified. Policies in active state can only be modified to change the state (from active to archive).

Maximum length allowed for 'name' parameter: 80 characters, maximum length allowed for 'description' parameter: 1000 characters. If the parameter 'name' is modified, allowed characters for the 'name' parameter: letters from any language, numbers in any script, space, dot, underscore, hyphen. Strings with characters other than these are rejected (only for the name parameter).

PUT /v2/policies/{policy_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • policy ID

policy json

Response

Response for the /v2/policies/{policyId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

delete a policy

Deletes a policy.

DELETE /v2/policies/{policy_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • policy ID

Response

Status Code

  • No Content (OK)

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

list all policy categories

Lists all policies categories. This includes categories that are children of other categories. When more than one filter criteria is specified, the resulting collection satisfies all the criteria.

GET /v2/policy_categories
Request

Custom Headers

  • IBM IAM access token

Query Parameters

  • Specify name of the category to search for or use filter of the form 'contains:xx' to search for categories containing provided phrase as part of name or use filter of the form 'exact:xx' to search for categories with exact name.

  • Specify description of the category to search for or use filter of the form 'contains:xx' to search for categories containing provided phrase as part of description.

  • Specify the policy id to search for categories containing the policy.

  • The order to sort the categories. The following values are allowed:

    • name, -name -- ascending or descending order by the name
    • modified_date, -modified_date -- ascending or descending order by modified date

    Allowable values: [name,-name,modified_date,-modified_date]

  • The maximum number of Categories to return. The default value is 50.

  • The index of the first matching Category to include in the result. The default value is 0.

Response

Response for the /v2/policy_categories API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

create a policy category

Creates a new policy category. If parent_category_id is specified in the request JSON then the category will be created as a child of that category. We allow only passing parent_category_id, if a hierarchy needs to be established. Maximum length allowed for 'name' parameter: 128 characters, maximum length allowed for 'description' parameter: 1000 characters. Allowed characters for the 'name' parameter: letters from any language, numbers in any script, space, dot, underscore, hyphen. Strings with characters other than these are rejected (only for the name parameter).

POST /v2/policy_categories
Request

Custom Headers

  • IBM IAM access token

category json

Response

Response for the /v2/policy_categories/{categoryId} API

Status Code

  • successful operation

  • Created

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

retrieve a policy category

Retrieves details of a policy category given the identifier for the category.

GET /v2/policy_categories/{category_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • category ID

Response

Response for the /v2/policy_categories/{categoryId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

update a policy category

Updates information on a policy category. This API is also be used to add and remove policies from a category. We do not allow updating parent_category_id Maximum length allowed for 'name' parameter: 128 characters, maximum length allowed for 'description' parameter: 1000 characters. If the parameter 'name' is modified, allowed characters for the 'name' parameter: letters from any language, numbers in any script, space, dot, underscore, hyphen. Strings with characters other than these are rejected (only for the name parameter).

PUT /v2/policy_categories/{category_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • category ID

category json

Response

Response for the /v2/policy_categories/{categoryId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

delete a policy category

Deletes a policy category given the identifier for the category. This will also delete any child categories of that category.

DELETE /v2/policy_categories/{category_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • category ID

Response

Status Code

  • No Content (OK)

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

This API retrieves policy enforcement metrics based on query parameters, which include metric type, date range, policy, user, rule and outcome. It then sums these metrics according to a specified aggregation type.

GET /v2/policy_metrics
Request

Custom Headers

  • IBM IAM access token

Query Parameters

  • The type of metrics to return

    Allowable values: [enactments,violations,operational_policies]

  • The type of aggregation to perform on the selected metrics (days, months, years, policies, users, outcomes)

    Allowable values: [days,months,years,policies,users,outcomes]

  • ISO 8601 date/time specifying the starting time to return metrics data

  • ISO 8601 date/time specifying the ending time to return metrics data

  • The policy to return metrics of, or all policies if not specified

  • The user to return metrics about, or all users if not specified

  • The enforcement outcome to return metrics about, or all outcomes if not specified

  • The order to sort the metrics. The following values are allowed:

    • aggregate, -aggregate -- ascending or descending order by the aggregate
    • count, -count -- ascending or descending order by enactment count

    Allowable values: [aggregate,-aggregate,count,-count]

Response

Response for the /v2/policy_metrics API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

list all rules

Lists all defined rules. This includes all rules associated with policies and rules not associated with any policy. When more than one filter criteria is specified, the resulting collection satisfies all the criteria.

GET /v2/policy_rules
Request

Custom Headers

  • IBM IAM access token

Query Parameters

  • Specify name of the rule to search for or use filter of the form 'contains:xx' to search for rules containing provided phrase as part of name or use filter of the form 'exact:xx' to search for rules with exact name.

  • Specify description of the rule to search for or use filter of the form 'contains:xx' to search for rules containing provided phrase as part of description.

  • If specified, only rules with a matching trigger expression will be returned.

  • If specified, only rules with a matching action will be returned.

  • The order to sort the rules. The following values are allowed:

    • name, -name -- ascending or descending order by the name
    • modified_date, -modified_date -- ascending or descending order by modified date

    Allowable values: [name,-name,modified_date,-modified_date]

  • The maximum number of Rules to return. The default value is 50.

  • The index of the first matching Rule to include in the result. The default value is 0.

Response

Response for the /v2/policy_rules API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

create a rule

Creates a rule. A rule has an trigger expression defining when the rule should be enforced as well as a definition of what operation and outcome to enforce.

Trigger expressions are represented using nested arrays. The following describes the syntax of those arrays:

 Expression: 
 [ -conditions- ] 
 
 Conditions: 
 -predicate- 
 "NOT", -predicate- 
 -predicate-, "AND"|"OR", -conditions- 
 "NOT", -predicate-, "AND"|"OR", -conditions- 
 
 Predicate: 
 [ "$-term-", "EXISTS" ] 
 [ "$-term-", "EQUALS"|"LESS_THAN"|"GREATER_THAN"|"CONTAINS", "#-literal-"|"$-term-" ] 
 -expression- 

where:

  • -term- is a technical term defined in the term glossary.
  • -literal- is a literal. For numerics a string representation of the number should be specified. For times, milliseconds are used (from Unix epoch). For boolean, #true and #false are used.

The definition of the operators in a predicate:

  • EXISTS -- means that the term has a value of some kind.
  • EQUALS -- evaluates to true if the left and right sides are equal.
  • LESS_THAN -- evaluates to true if the left is less in numeric value than the right.
  • GREATER_THAN -- evaluates to true if the left is greater in numeric value than the right.
  • CONTAINS -- is meant to test an array term (such as Asset.Tags) with a single value. It evaluates to true if the value on the right side equals one of the values on the array on the left side.
    However it will also supports a single value on the left, in which case it behaves just like EQUALS -- regular expressions or wildcards are not supported.

For all of the operators (except EXISTS), if the right hand side evaluates to an array, each value of the array is compared to the left side, according to the operator definition, and if any comparison is true then the result of the evaluation is true.

Examples:

 [["$Asset.Type", "EQUALS", "#Project"]] 
 ["NOT", ["$Asset.Tags", "CONTAINS", "#sensitive"], "AND", ["NOT", "$Asset.Tags", "CONTAINS", "#confidential"]] 
 [["$User.Name", "EQUALS", "$Asset.Owner"]] 

Maximum length allowed for 'name' parameter: 80 characters, maximum length allowed for 'description' parameter: 1000 characters. Allowed characters for the 'name' parameter: letters from any language, numbers in any script, space, dot, underscore, hyphen. Strings with characters other than these are rejected (only for the name parameter).

POST /v2/policy_rules
Request

Custom Headers

  • IBM IAM access token

Rule json

Response

Response for the /v2/policy_rules/{ruleId} API

Status Code

  • successful operation

  • Created

  • Bad Request

  • Unauthorized

  • Forbidden

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

retrieve a rule

Retrieves detailed information on a rule given the rule's identifier.

GET /v2/policy_rules/{rule_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Rule ID

Response

Response for the /v2/policy_rules/{ruleId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

update a rule

Updates a rule. Maximum length allowed for 'name' parameter: 80 characters, maximum length allowed for 'description' parameter: 1000 characters. If the parameter 'name' is modified, allowed characters for the 'name' parameter: letters from any language, numbers in any script, space, dot, underscore, hyphen. Strings with characters other than these are rejected (only for the name parameter). The governance_type_id cannot be modified.

PUT /v2/policy_rules/{rule_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Rule ID

Rule json

Response

Response for the /v2/policy_rules/{ruleId} API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

delete a rule

Deletes a rule. A rule can only be deleted if it is not currently associated with any policy (in any state).

DELETE /v2/policy_rules/{rule_id}
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Rule ID

Response

Status Code

  • No Content (OK)

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

retrieve terms used in a rule

Retrieves the names of all terms used in a rule.

GET /v2/policy_rules/{rule_id}/terms
Request

Custom Headers

  • IBM IAM access token

Path Parameters

  • Rule ID

Response

Response for the /v2/policy_rules/{ruleId}/terms API

Status Code

  • OK

  • Bad Request

  • Unauthorized

  • Forbidden

  • Not Found

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

create a new Streams Flow

Creates a Streams Flow in the project with the specified fields in the body of the request.

The Streams Flow will exist within the project specified by project_id, providing the caller is an authenticated user and has admin/editor role in the project.

For a Streams Flow to be deployable, it will require a pipeline graph, as well as the instance_id of the Streaming Analytics instance.

To create the Streams Flow using a pipeline object of an existing Streams Flow, copy the entity.pipeline from the response of the GET /v2/streams_flow/{id}.

POST /v2/streams_flows
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Query Parameters

  • The ID of the project to use.

body of streams flow

Response

Status Code

  • Created

  • Invalid request

  • Unauthorized

  • Forbidden

  • Error

  • Not Implemented

No Sample Response

This method does not specify any sample responses.

return a list of a Streams Flow's runs

Return a list of Streams Flows within the project of the given project_id, provided that this authenticated user has read access.

GET /v2/streams_flows
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Query Parameters

  • The ID of the project to use.

Response

Status Code

  • Success

  • Unauthorized

  • Forbidden

  • Error

No Sample Response

This method does not specify any sample responses.

return a Streams Flow

Return a Streams Flow with this given id, provided the authenticated user has read access to the project. To create another flow with this same pipeline, copy the pipeline object from the entity of the response. Use this pipeline object in the body of the POST.

GET /v2/streams_flows/{id}
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Path Parameters

  • ID of Flow to get.

Query Parameters

  • The ID of the project to use.

Response

Status Code

  • Success

  • Not Found

  • Unauthorized

  • Forbidden

  • Error

No Sample Response

This method does not specify any sample responses.

delete a Streams Flow

Delete a Streams Flow, provided the authenticated user has admin/editor role in the project this Streams Flow is contained in.

If the Streams Flow is currently deployed, the job will be stopped before the Streams Flow is deleted.

DELETE /v2/streams_flows/{id}
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Path Parameters

  • ID of Flow to get.

Query Parameters

  • The ID of the project to use.

Response

Status Code

  • No content

  • Not Found

  • Unauthorized

  • Forbidden

  • Error

No Sample Response

This method does not specify any sample responses.

modify an existing Streams Flow

Modifies an existing Streams Flow.

Can update the name, description, pipeline, and instance_id of the Streaming Analytics instance.

PATCH /v2/streams_flows/{id}
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Path Parameters

  • ID of Flow to update.

Query Parameters

  • The ID of the project to use.

body of streams flow

Response

Status Code

No Sample Response

This method does not specify any sample responses.

start or stop a Streams Flow

This post request can be used to start or stop the deployment of a Streams Flow, providing the authenticated user has the admin/editor role in the project this Streams Flow is contained in. The post request will return successful once the API receives the request, however the start/stop deployment of the Streams Flow will continue to run asynchronously until completion. It is a long-running operation.

The status of the deployment can be tracked by the GET /v2/streams_flows/:guid/runs/:r_id request. The request will only go through if the Streams Flow is in the correct state.

  • To start deployment, send body containing { "state": "started" }. This requires the Streams Flow to be in the stopped state.

  • To stop deployment, send body containing { "state" : "stopped" }. This requires the Streams Flow to be in the running state.

POST /v2/streams_flows/{id}/runs
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

  • UAA token.

Path Parameters

  • ID of Streams Flow to get.

Query Parameters

  • The ID of the project to use.

FlowRun creation body parameters

Response

Status Code

  • Created

  • Invalid Request

  • Unauthorized

  • Forbidden

  • Unknown Error

No Sample Response

This method does not specify any sample responses.

return a list of a Streams Flow's Runs

Get the runtime status of a given Streams Flow. A Streams Flow that has status of "running" implies the Streams Flow has been successfully deployed and is currently running. When the Streams Flow is in the "stopped" state, a PATCH request can be issued to deploy the Streams Flow.

GET /v2/streams_flows/{id}/runs
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Path Parameters

  • ID of Streams Flow to get.

Query Parameters

  • The ID of the project to use.

Response

Status Code

  • Success

  • Not Found

  • Unauthorized

  • Forbidden

  • Error

No Sample Response

This method does not specify any sample responses.

return the Streams Flow's Run

return a Streams Flow's run

GET /v2/streams_flows/{id}/runs/{r_id}
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Path Parameters

  • ID of Flow to get.

  • ID of FlowRun to get.

Query Parameters

  • The ID of the project to use.

Response

Status Code

  • Success

  • Not Found

  • Unauthorized

  • Forbidden

  • Error

No Sample Response

This method does not specify any sample responses.

start or stop a Streams Flow

This patch request can be used to start or stop the deployment of a Streams Flow, providing the authenticated user has the admin/editor role in the project this Streams Flow is contained in. The patch request will return successful once the API receives the request, however the start/stop deployment of the Streams Flow will continue to run asynchronously until completion. It is a long-running operation.

The status of the deployment can be tracked by the GET /v2/streams_flows/:guid/runs/:r_id request. The request will only go through if the Streams Flow is in the correct state.

  • To start deployment, send body containing { "state": "started" }. This requires the Streams Flow to be in the stopped state.

  • To stop deployment, send body containing { "state" : "stopped" }. This requires the Streams Flow to be in the running state.

PATCH /v2/streams_flows/{id}/runs/{r_id}
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

  • UAA token.

Path Parameters

  • ID of Streams Flow to get.

  • ID of FlowRun to update.

Query Parameters

  • The ID of the project to use.

FlowRun creation body parameters

Response

Status Code

  • Success

  • Invalid request

  • Unauthorized

  • Forbidden

  • Error

No Sample Response

This method does not specify any sample responses.

Creates an import request that is eligible for execution.

Steps to import glossary terms from a CSV or XMI file:
(1) Do a POST to /v2/glossary_imports and retrieve the <guid> from the json response.
(2) With the <guid> obtained in step 1, do a POST to /v2/glossary_imports/<guid> to upload the file and start the import. This request returns immediately after starting the import.
(3) A GET to /v2/glossary_imports/<guid> can be used to query the status of an import. It will return 'done' when the import completes.
(4) After an import is complete, do a GET to /v2/glossary_imports/<guid>/report to retrieve the import results report.

POST /v2/glossary_imports
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Query Parameters

  • The format of the file being imported. Supported formats are "CSV" and "XMI"

    Allowable values: [CSV,XMI]

  • The name of the file being imported. This has no impact on the import processing. It is just used for reporting.

  • Initial term state. This parameter is required for CSV files without a State column. It is optional for all other imports. If present, the provided term state overrides the term states from the file. valid values are "DRAFT","ACTIVE","ARCHIVED"

    Allowable values: [DRAFT,ACTIVE,ARCHIVED]

Response

Status Code

  • Accepted

  • Bad Request

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Gets the status of an import.

The method can be used for showing the progress of the asynchrounous import triggered by POST /v2/glossary_imports/{guid} method.

GET /v2/glossary_imports/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the Import

Response

Status Code

  • Success

  • Bad Request

  • Unauthorized

  • The import was not found.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Uploads the file to start an import.

This request will return after the import file upload has completed. The import will continue running in the background. To obtain progress information about the running import, call GET /v2/glossary_imports/{guid}.

Steps to import glossary terms from a CSV or XMI file:
(1) Do a POST to /v2/glossary_imports and retrieve the <guid> from the json response.
(2) With the <guid> obtained in step 1, do a POST to /v2/glossary_imports/<guid> to upload the file and start the import. This request returns immediately after starting the import.
(3) A GET to /v2/glossary_imports/<guid> can be used to query the status of an import. It will return 'done' when the import completes.
(4) After an import is complete, do a GET to /v2/glossary_imports/<guid>/report to retrieve the import results report.

There are two types of files that can be imported:

  • Information Server Information Governance Catalog XMI Files
  • CSV Files

XMI Files

We support importing XMI files exported from Information Server Information Governance Catalog 11.3 and higher. Imports from both the development glossary and the catalog are supported.

CSV Files

The CSV file must contain a header row that contains the names of the columns. The column names we accept are Name, Display Name, Description, Business Definition, Owner, Tags, and State. These column names are case sensitive. The presence of any other columns will be will be treated as a bad request. Only the Name field is required. If there is no Display Name column, the Display Name will be set to the Name. The Tags field contains a comma-delimited list of tags in the term. Any tags that contain commas need to have those commas escaped by doubling the comma.


The CSV file must comply with https://tools.ietf.org/html/rfc4180.



Example
Name,Display Name,Business Definition,Description,Owner,Tags,State
Term 1,Display Name for Term Number One,Business Definition for Term One, This is a description of the first term to import.,me@us.ibm.com,"first tag, second tag",ACTIVE
Term 2,Display Name for Term Number Two,Business Definition for Term One, This is a description of the second term to import.,them@us.ibm.com,"first tag, second tag",DRAFT

POST /v2/glossary_imports/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Specifies the length of the import file, in bytes

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the Import to start

The input stream for reading imported terms

Response

Status Code

  • The file to import was successfully loaded.

  • Bad Request

  • Unauthorized

  • The request failed because an import with the specified {guid} does not exist in the system.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Requests that the specified import be cancelled.

If the import is completed by the time it gets the status of the import, this method returns without cancelling the import. Also, note that any terms imported before the cancellation is processed will not be removed. The cancellation will just prevent any additional terms from being imported.

DELETE /v2/glossary_imports/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the Import

Response

Status Code

  • The cancel request was processed

  • Bad Request

  • Unauthorized

  • The import was not found.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Creates a report showing the import results.

The report cannot be obtained while the import is in WAITING or IN_PROGRESS state.

GET /v2/glossary_imports/{guid}/report
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the Import

Response

Status Code

  • Success

  • Bad Request

  • Unauthorized

  • The import was not found.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

List the status of imported terms for the given import ID.

The list is incomplete if the import is in progress or waiting. You can check the status of the import by calling GET /v2/glossary_imports/{guid}.

GET /v2/glossary_imports/{guid}/results
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the Import

Query Parameters

  • Filter by the status of imported term.
    Valid values are SUCCESS, FAILED

    Allowable values: [SUCCESS,FAILED]

  • The maximum number of Terms to return.
    The default value is 50.

  • Bookmark that gives the start of the page.

  • Sorting order.
    Valid values are row_number, term_display_name and failure_reason
    Prefix hyphen (-) for descending order.

Response

Status Code

  • Success

  • Bad Request

  • Unauthorized

  • The import was not found.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Deprecated. Use <code>GET /v2/glossary_imports/{guid}/results</code> method.

Deprecated

GET /v2/glossary_imports/{guid}/search
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the Import

Query Parameters

  • Filter by the status of imported term, valid vallus are "SUCCESS" and "FAILED"

    Allowable values: [SUCCESS,FAILED]

  • The maximum number of Terms to return; The default value is 50.

  • Bookmark that gives the start of the page

  • Sorting order; Valid values are term_name, failure_reason; Prefix hyphen (-) for descending order

  • Deprecated. The maximum number of Terms to return; The default value is 50.

  • Deprecated. Bookmark that gives the start of the page

  • Deprecated. Sorting order; Valid values are term_name, failure_reason; Prefix hyphen (-) for descending order

Response

Status Code

  • Success

  • Bad Request

  • Unauthorized

  • The import was not found.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Adds terms to the glossary.

If the unique constraint on the name or display name of the term is violated, the method fails with 409 Conflict response.

Sends a RabbitMQ message with the IDs of the created terms and CREATE_TERM event.

Administrator role is required.

POST /v2/glossary_terms
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Terms to be created. The terms array must contain at least 1 term, and cannot exceed 100 terms.

Response

Status Code

  • The terms were created successfully.

  • Bad Request.

  • Unauthorized.

  • Unique constraint on the name or display name of the term was violated.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Runs Lucene search query for the archived terms in the glossary.

Lucene query using the following keys: term.display_name, term.display_name_keyword, term.name, term.description, term.type, term.sub_type, term.updated_at, term.associated_business_terms, term.associated_technical_term, and term.tags
   term.display_name: Matches complete display name and the value should be always in lower case
   term.display_name_keyword: Matches words within the display name
   term.associated_display_name: Matches complete display name of associated terms and the value should be always in lower case
   term.name: Matches complete name and the value should be always in lower case
   term.type: Valid values are Business Terms, Technical Terms
   term.sub_type: Valid values are Asset Level Classification, DPS Context Attribute, DPS Data Attribute, DPS Enumeration, DPS Conceptual, IBM Classifier, IBM Group Classifier, Custom Classifier

Example queries:
   Search for archived terms that have display name starting with 'C', 'D', 'E', etc.: term.display_name:[c* TO z*];
   Search for archived terms that have "Customer" in the display name and has "PII" or "SSN" tags: (term.display_name_keyword:"Customer" AND (term.tags:PII OR term.tags:SSN))

If the result set of the query is larger than the limit parameter, it returns the first limit number of archived terms.
To retrieve the next set of archived terms, call the search method again by using the URI in TokenPaginatedTermList.next returned by this method.

Administrator role is required.

POST /v2/glossary_terms/archive/search
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Query Parameters

  • The maximum number of Terms to return - must be at least 1 and cannot exceed 200. The default value is 50.

  • Bookmark that gives the start of the page.

  • Specifies the sort order of the results in field_name format.
    Valid values are:
    term.display_name<string> for sorting terms by name in ascending order.
    term.updated_at<string> for sorting terms by updated time in ascending order.
    Prefix hyphen (-) for descending order.
    If sort parameter is not specified, then the results are searched by updated time in ascending order.

Search query

Response

Status Code

  • Successful Operation.

  • Bad Request

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Gets the archived term in the glossary with a given guid.

Administrator role is required.

GET /v2/glossary_terms/archive/{guid}
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the archived term to fetch.

Response

Status Code

  • The archived term was successfully retrieved.

  • Unauthorized

  • The archived term with specified {guid} does not exist in the glossary.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Purge the archived glossary term with a given guid.

This operation results in a permanent hard delete of the archived term. A purged term cannot be restored.

Sends a RabbitMQ message with the term ID and PURGE_TERM event.

Administrator role is required.

DELETE /v2/glossary_terms/archive/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the term to purge.

Response

Status Code

  • The archived term was successfully purged.

  • Bad Request

  • Unauthorized

  • An archived term with the specified {guid} does not exist in the glossary.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Restores the archived glossary term with a given guid.

The state of the restored term is DRAFT.

Sends a RabbitMQ message with the restored term ID and RESTORE_TERM event.

Administrator role is required.

POST /v2/glossary_terms/archive/{guid}/restore
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the term to restore.

Response

Status Code

  • The term was successfully restored.

  • Bad Request

  • Unauthorized

  • An archived term with the specified {guid} does not exist in the glossary.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Business Glossary Heartbeat.

Returns the build details and health of the dependent services

GET /v2/glossary_terms/heartbeat
Request

No Request Parameters

This method does not accept any request parameters.

Response

Status Code

  • Successful Operation.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Runs Lucene search query for the non-archived terms in the glossary.

Lucene query using the following keys: term.display_name, term.display_name_keyword, term.name, term.description, term.type, term.sub_type, term.updated_at, term.associated_business_terms, term.associated_technical_term, term.state, and term.tags
   term.display_name: Matches complete display name and the value should be always in lower case
   term.display_name_keyword: Matches words within the display name
   term.associated_display_name: Matches complete display name of associated terms and the value should be always in lower case
   term.name: Matches complete name and the value should be always in lower case
   term.type: Valid values are Business Terms, Technical Terms
   term.sub_type: Valid values are Asset Level Classification, DPS Context Attribute, DPS Data Attribute, DPS Enumeration, DPS Conceptual, IBM Classifier, IBM Group Classifier, Custom Classifier
   term.state: Valid values are DRAFT, ACTIVE
   term.ancestors: Name of ancestor of technical terms. It cannot be used for business terms.

Example queries:
   Search for terms that have display name starting with 'C', 'D', 'E', etc.: term.display_name:[c* TO z*];
   Search for terms that have "Customer" in the display name and has "PII" or "SSN" tags: (term.display_name_keyword:"Customer" AND (term.tags:PII OR term.tags:SSN))

If the result set of the query is larger than the limit parameter, it returns the first limit number of terms.
To retrieve the next set of terms, call the search method again by using the URI in TokenPaginatedTermList.next returned by this method.

POST /v2/glossary_terms/search
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Query Parameters

  • Additional filter to retrieve business terms associated with the given asset_id.

  • The maximum number of Terms to return - must be at least 1 and cannot exceed 200. The default value is 50.

  • Bookmark that gives the start of the page.

  • Deprecated.

  • Deprecated.

  • Specifies the sort order of the results in field_name format.
    Valid values are:
    term.display_name<string> for sorting terms by name in ascending order.
    term.updated_at<string> for sorting terms by updated time in ascending order.
    term.created_at<string> for sorting terms by creation time in ascending order.
    Prefix hyphen (-) for descending order.
    If sort parameter is not specified, then the results are searched by creation time in ascending order.

Search query and other search options.

Response

Status Code

  • Successful Operation.

  • Bad Request

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Lists all of the unique tags that have been applied to glossary terms

If the result set of the query is larger than the limit parameter, it returns the first limit number of tags.
To retrieve the next set of tags, call the method again by using the URI in PaginatedTagsList.next returned by this method.

GET /v2/glossary_terms/tags
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Query Parameters

  • The maximum number of tags to return - must be at least 1 and cannot exceed 200. The default value is 50.

  • The index of the first matching tag to include in the result.

  • Deprecated.

  • Deprecated.

Response

Status Code

  • Success.

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Gets the term in the glossary with a given guid.

This method can be used for retrieving details of an ACTIVE or DRAFT term.

GET /v2/glossary_terms/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the term to fetch.

Response

Status Code

  • The term was successfully retrieved.

  • Unauthorized

  • The term with specified {guid} does not exist in the glossary.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Archives or purges the glossary term with a given guid, depending on term state.

If the term state is DRAFT, the term is purged (hard deleted).
If the term state is ACTIVE, the term is archived.

Archived terms can be searched by calling POST /v2/glossary_terms/archive/search.
A specific archived term can be retrieved by calling GET /v2/glossary_terms/archive/{guid}.

Sends a RabbitMQ message with the term ID, associated technical term ID (if any), and ARCHIVE_TERM or PURGE_TERM event.

Administrator role is required.

DELETE /v2/glossary_terms/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the term to archive.

Response

Status Code

  • The term was successfully archived/purged.

  • Bad Request

  • Unauthorized

  • The term with specified {guid} does not exist in the glossary.

  • The term is currently used in catalog assets, catalog asset columns, data governance rules or data governance policies.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Updates a term in the glossary.

Sends a RabbitMQ message with the term ID and UPDATE_TERM event. Administrator role is required.

PATCH /v2/glossary_terms/{guid}
Request

Custom Headers

  • JWT Bearer token

  • Runs the operation as a different tenant. Requires the FunctionalUser role. Format: accountId[:userId]

Path Parameters

  • The guid of the term to be updated.

The updated term. Fields omitted will be unchanged, and fields set to null explicitly will be nulled out.

Additional Example:

{
    "description": "Updated description."
}

Response

Status Code

  • The term was successfully updated.

  • Bad Request

  • Unauthorized

  • Forbidden

  • The update failed because the term with specified {guid} does not exist in the glossary.

  • The term is currently used in catalog assets, catalog asset columns, data governance rules or data governance policies.

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

List the associations of the given type for the specified term.

If the result set is larger than the limit parameter, it returns the first limit number of associations.
To retrieve the next set of associations, call the method again by using the URI in PaginatedTagsList.next returned by this method.

Associations of a child term, like SSN, includes the associations of its parent terms, like Government Identities.
Associations of a business term includes the associations of the mapped technical term.
Associations of a technical term includes the associations of the mapped business term.

GET /v2/glossary_terms/{guid}/associations
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the Term

Query Parameters

  • Comma separated list of association types. Allowed values of association types are ASSET, DPS_RULE, DPS_POLICY, COLUMN, ALL

  • Name of the associated entity - policy name, rule name, asset name, or column name

  • Filter the results by the container of the associated entity.
    Associations of ASSET type can be filtered by passing the list of catalog_ids. This parameter is not allowed for other association types.
    Multiple values in the list can be passed as comma separated values e.g. catalog_ids=XXXXXXX,XXXXXXX

  • The maximum number of associations to return - must be at least 1 and cannot exceed 200. The default value is 50.

  • Bookmark that gives the start of the page.

  • Sorting order.
    Valid values are asset_name<string>, rule_name<string>, rule_id<string>, policy_name<string>, policy_id<string>, created_at<string>column_name<string>.
    Prefix hyphen (-) for descending order

Response

Status Code

  • Success

  • Bad Request

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Gets the rule range for the given term.

If the result set of the permitted terms is larger than the limit parameter, it returns the first limit number of permitted terms.
To retrieve the next set of permitted terms, call the rule_range method again by using the URI in permitted_terms.next returned by this method.

GET /v2/glossary_terms/{guid}/rule_range
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the term.

Query Parameters

  • The maximum number of permitted terms to return - must be at least 1 and cannot exceed 200. The default value is 50.

  • Bookmark that gives the start of the next page of permitted terms to be returned.

  • Deprecated.

  • Deprecated.

  • Deprecated.

Response

Status Code

  • Success.

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

Lists the view count for the given term.

The view count can be retrieved for a given time scale and given period of time.

GET /v2/glossary_terms/{guid}/stats
Request

Custom Headers

  • JWT Bearer token

Path Parameters

  • The guid of the Term

Query Parameters

  • Start time specified as seconds since 1970-1-1 0:0:0, or as a timestamp string in either YYYY-MM-DDTHH:mm:ssZ or YYYY-MM-DDTHH:mm:ss.sssZ format

  • Time scale for the output data. Valid values are 'hour, day, week, month, year'. For example, if user specifies hour, response will contain term count for each hour starting from given startTime till given endTime.

    Allowable values: [hour,day,week,month,year]

  • End time in seconds since 1970-1-1 0:0:0, or as a timestamp string in either YYYY-MM-DDTHH:mm:ssZ or YYYY-MM-DDTHH:mm:ss.sssZ format. If not specified, defaults to current time.

Response

Status Code

  • Success

  • Bad Request

  • Unauthorized

  • Internal Server Error

No Sample Response

This method does not specify any sample responses.

List data profiles

Status: Complete

Get the list of data profiles in the specified project or catalog for a given dataAsset, provided the caller has the necessary rights to do so. The returned results can be filtered by using one or more of the listed parameters.

| Field | Match type | Example | | --------------- | ------------ | ------------------------------ | | dataset_id | Equals | ?dataset_id=5210c7d-cf6b |

GET /v2/data_profiles
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Query Parameters

  • The ID of the data set to use.

  • The ID of the catalog to use. catalog_id or project_id is required.

  • The ID of the project to use. catalog_id or project_id is required.

  • Whether to include the entity component. If set to false, only the metadata component is populated in the response.

    Default: true

Response

A page from a collection of data profiles.

Status Code

  • Success.

  • You are not permitted to perform this action.

  • You are not authorized to list the available data profiles.

  • Not Found.

  • An error occurred. The data profiles cannot be listed.

No Sample Response

This method does not specify any sample responses.

Create data profile

Status: Complete

Creates a data profile for a given data set in the specified project or catalog, provided the caller has the necessary rights to do so. Subsequent calls to use the data profile must specify the relevant project or catalog ID that the data profile was created in.

The request payload must include the 'metadata' section containing the data set id and catalog/project id.

The request payload can have the 'entity' section which is optional specifying the data profile options. If these options are not specified, default options are taken.

AMQP 1.0 Messages

When a new data profile is created, a message is fired with a state of the new data profile in the body.

Topic

v2.data_profiles.:guid.POST ,where the ":guid" represents the profile_id of the created Data Profile

Subscribe to it by using "v2.data_profiles.*.POST" Binding Key.

Example Message

Topic: v2.data_profiles.5210c7d-cf6b-4204-95d2-95d84ecbe382.POST

Message:

{
 "event": "CREATE_DATA_PROFILE",
 "actor": {          
   "user_name": "john@acme.com"          
 },
 "published": "2015-05-10T15:04:15Z",
 "url": { /* The href of the DataProfile created */}
 "status_code": { /* The Http Status code , 201 if the DataProfile is created successfully */}
 "state": { /* the data profile object equivalent to the one obtained by GET /v2/data_profiles/{profile_id} API */ }
 "details": {
    "catalog_id": "f3c59258-abdd-4e24-828b-0495ec519339",
    "dataset_id": "e522db21-59e8-44ab-81b2-bb40c3030a6f",
    "profile_id": "5165d439-96f0-40d4-90b2-93795deab61b",
    "is_governed": false { /* set to true if the catalog is_governed else set it to false}
   }
}
POST /v2/data_profiles
Request

Custom Headers

  • Identity Access Management (IAM) bearer token.

Query Parameters

  • Whether to start the profiling service immediately after the data profile is created.

The data profile to create.

Response

A data profile holds the data profiling controls, options and results of a data set.

Status Code

  • Success.

  • Accepted.

  • You are not authorized to create a data profile.

  • You are not permitted to perform this action.

  • The data profile cannot be found.

  • An error occurred. The data profile cannot be created.

No Sample Response

This method does not specify any sample responses.

Modify asset level classification

Status: Complete

Modifies asset level classification detail in the data_profile attribute in the specified project or catalog, provided the caller has the necessary rights to do so. This API is used for CRUD operations on asset level classification.

The patch request for classification should contain the classification details that are to be added to the data_profile attribute.

The updates must be specified by using the JSON patch format, described in RFC 6902.

PATCH /v2/data_profiles/classification
Request

Custom Headers