Introduction
You can use a collection of Watson Data REST APIs associated with Watson Studio and Watson Knowledge Catalog to manage data-related assets and connections in analytics projects and catalogs on IBM Cloud Pak for Data.
Catalog data Use the catalog and asset APIs to create catalogs to administer your assets, associate properties with those assets, and organize the users who use the assets. Assets can be notebooks or connections to files, database sources, or data assets from a connection.
Govern data Use the governance and workflows APIs to implement data policies and a business glossary that fits to your organization to control user access rights to assets and to uncover data quality and data lineage.
Add and find data Use the discovery, search, and connections APIs to add and find data within your projects and catalogs.
API Endpoint
https://{cpd_cluster_host}
Creating an CPD bearer token
A bearer token from IBM Cloud Pak for Data is required to use any of the Watson Data APIs.
Visit the authorization section on Cloud Pak for Data for more information.
Use the value of the access_token
property from the curl command on the right. Set the access_token
value as the authorization header parameter for requests to the Watson Data APIs. The format is Authorization: Bearer <access_token_value_here>
. For example:
Authorization: Bearer eyJraWQiOiIyMDE3MDgwOS0wMDowMDowMCIsImFsZyI6IlJTMjU2In0...
Curl command with API key to retrieve token
curl -k -X POST https://cpd_cluster_host/icp4d-api/v1/authorize -H 'cache-control: no-cache' -H 'content-type: application/json' -d '{"username":"admin","password":"password"}'
Response
{
"_messageCode_": "200",
"message": "Success",
"token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6ImFkbWluIiwicm9sZSI6IkFkbWluIiwicGVybWlzc2lvbnMiOlsiYWNjZXNzX2FkdmFuY2VkX2dvdmVybmFuY2VfY2FwYWJpbGl0aWVzIiwic2lnbl9pbl9vbmx5IiwiYWNjZXNzX2NhdGFsb2ciLCJhY2Nlc3NfaW5mb3JtYXRpb25fYXNzZXRzIiwiYWRtaW5pc3RyYXRvciIsIm1hbmFnZV9xdWFsaXR5IiwiY2FuX3Byb3Zpc2lvbiIsIm1hbmFnZV9kaXNjb3ZlcnkiLCJtYW5hZ2VfbWV0YWRhdGFfaW1wb3J0IiwidmlydHVhbGl6ZV90cmFuc2Zvcm0iLCJtYW5hZ2VfY2F0YWxvZyIsImF1dGhvcl9nb3Zlcm5hbmNlX2FydGlmYWN0cyIsIm1hbmFnZV9jYXRlZ29yaWVzIiwibWFuYWdlX2dvdmVybmFuY2Vfd29ya2Zsb3ciLCJtYW5hZ2VfaW5mb3JtYXRpb25fYXNzZXRzIiwidmlld19xdWFsaXR5Iiwidmlld19nb3Zlcm5hbmNlX2FydGlmYWN0cyJdLCJzdWIiOiJhZG1pbiIsImlzcyI6IktOT1hTU08iLCJhdWQiOiJEU1giLCJ1aWQiOiIxMDAwMzMwOTk5IiwiYXV0aGVudGljYXRvciI6ImRlZmF1bHQiLCJpYXQiOjE2MDQwMjI0NTQsImV4cCI6MTYwNDA2NTYxOH0.xtBNYtShcFO51Ja1AEN_hkGwEcq00bshXwZ_rTTGzJu1BHzATpE6JlI5ssBky8ojoJH1PH_hvceCO2UBDwvv2bWIk3efvznADr0FjM_GZSMDd-sxoNkLoUGucdAxEQs80jVAN7OPnPrqDbRqE4D191TRubmtb22ys3H7adgTWrX0dLOQ-sW4zHa-rOEHi7yKvQyl-Jqs7IYpPXlNMnmezoMGfAzrtx9pAGABSVnFQItA3TVf-64jDtMWoDXyHllU3XsYMQKaRSAQnzSYX4WaQ5U6utsT1uTpQ6ViPId7LhYJz_lbwzo8I6CDRlfYRDsXTczQBTHG9vLhC25w6OexvA"
}
The bss_account_id parameter
Some APIs make use of bss_account_id
parameter, also referred to as tenant id.
Even though the system is designed to support multiple tenants, there is currently one and only one account (tenant) in CloudPak for Data with a static value of 999
. Always use 999
if a bss_account_id
parameter needs to be specified.
Versioning
Watson Data API has a major, minor, and patch version, following industry conventions on semantic versioning: Using the version number format MAJOR.MINOR.PATCH, the MAJOR version is incremented when incompatible API changes are made, the MINOR version is incremented when functionality is added in a backwards-compatible manner, and the PATCH version is incremented when backwards-compatible bug fixes are made. The service major version is represented in the URL path.
Error Handling
Responses with 400-series or 500-series status codes are returned when a request cannot be completed. The body of these responses follows the error model, which contains a code field to identify the problem and a message field to explain how to solve the problem. Each individual endpoint has specific error messages. All responses with 500 or 503 status codes are logged and treated as a critical failure requiring an emergency fix.
Catalogs
Watson Knowledge Catalog helps you easily organize, find and share data assets, analytical assets, etc. for many data science projects and for the users who need to use those assets.
You can use the Catalog API to create catalogs which are rich metadata repositories for organizing and exploring metadata.
There are two phrases that will be used repeatedly throughout this (and the "Assets" and "Asset Types") documentation:
-
asset resource
: The primary content of the asset. Many assets have a resource that is stored in an external repository: a data file, connected data set, notebook file, dashboard definition, or model definition. -
asset metadata
: The information about the asset resource. Each asset has a primary metadata document in a project or catalog and might have additional metadata documents.
See the Asset Terminology section for more information about those two phrases.
This section describes some of the individual Catalog APIs.
Get a Catalog
You can get metadata about a catalog using the get Catalog API. (Note: you aren't retrieving the actual data catalog with the GET Catalog API - you're just retrieving metadata that describes the catalog.)
Get Catalog - Request URL:
GET {service_URL}/v2/catalogs/{catalog_id}
Get Catalog - Response Body:
{
"metadata": {
"guid": "c6f3cbd8-2b7f-42fb-aa60-___",
"url": "https://api.dataplatform.cloud.ibm.com/v2/catalogs/c6f3cbd8-2b7f-42fb-aa60-___",
"creator_id": "IBMid-___",
"create_time": "2018-11-06T17:40:32Z"
},
"entity": {
"name": "CatalogForGettingStartedDoc",
"description": "Catalog created for Getting Started doc",
"generator": "Your catalog generator",
"bss_account_id": "12345___",
"capacity_limit": 0,
"is_governed": false,
"saml_instance_name": "IBM w3id"
}
}
Get Catalogs
To obtain the metadata for all the catalogs that you have access to (ie, are a collaborator of), you can call the GET Catalogs API.
Get Catalogs - Request URL:
GET {service_URL}/v2/catalogs
Note: the above URL is the simplest URL for getting catalogs because it doesn't contain any parameters. There are a number of optional parameters (limit
, bookmark
, skip
, include
, bss_account_id
) to the above URL that you can make use of to limit the number of catalogs for which metadata is returned.
Get Catalogs - Response Body:
{
"catalogs": [
{
"metadata": {
"guid": "c6f3cbd8-2b7f-42fb-aa60-___",
"creator_id": "IBMid-___",
"create_time": "2018-11-06T17:40:32Z"
},
"entity": {
"name": "CatalogForGettingStartedDoc",
"description": "Catalog created for Getting Started doc",
"generator": "Your catalog generator",
"bss_account_id": "12345___",
"capacity_limit": 0,
"is_governed": false,
"saml_instance_name": "IBM w3id"
}
}
],
"nextBookmark": "g1AAAAFCeJzLYWBgYMlgTmHQSklKzi9KdUhJMjT___",
"nextSkip": 0
}
In the above example, metadata for only one catalog is returned - the catalog created above. An advantage of calling the GET Catalogs API is you don't have to remember the ID of any particular catalog in order to get the metadata for that catalog.
Assets
From a high level, an asset is an item of data or data analysis in a project or catalog. Most of these assets consist of two parts:
-
Asset resource: The primary content of the asset. Many assets have a resource that is stored in an external repository: a data file (eg. text file, image, video, etc.), connected data set (eg. database table), notebook file, dashboard definition, or model definition. The Assets API does not affect this part of the asset. Think of this as the object that's being described by asset metadata (ie, an asset resource is a "decribee").
-
Asset metadata: The information about the asset resource. Each asset has a primary metadata document in a project or catalog and might have additional metadata documents. This is the part of the asset that you can get, create, or operate on with the Assets API. Think of this as the object that's doing the describing of an asset resource (ie, asset metadata is a "describer").
A library is a useful analogy for understanding the scope of the Assets API. A library contains a set of books and an index. The index, or card catalog, contains a card about each book. A card has information about the book, including the location of the book. A project or catalog contains only the card catalog part of the library. The books, or asset resources, are elsewhere. Consequently, the Assets API can return the location of an asset resource, but not affect the asset resource in any way.
The term asset encapsulates the following:
- [1] asset resource: the primary / initial resource that a user wants described by a primary metadata document.
- [2] primary metadata document: a document added to a catalog to describe an asset resource.
- [3] attributes: chunks of data inside a primary metadata document that describe either the asset resource or a secondary / extended metadata document.
- [4] secondary / extended metadata documents: additional documents containing information related to the asset resource. Attached to the primary metadata document. Can be generated by catalog processes, such as profiling.
- [5] a combination of all of the above: the Watson Knowledge Catalog UI presents information from each of the above on a single page and calls all that information an "asset".
For example, when you call the Get Assets API, you receive asset metadata (in a primary metadata document). The asset metadata might point to the location of the asset resource, but the Get Assets API does not return the asset resource. Similarly, when you run the Create Assets API, you create a primary metadata document that can, eventually, include the location of an existing asset resource.
This overview section provides a picture of the parts of a "primary metadata document" and then explains the parts of that picture. The picture provides a kind of "map" of a primary metadata document, so it's recommended to spend a few minutes studying it. Readers who prefer API examples can skip over the explanation of that picture that follows, and go straight to the Assets API Examples section. However, the Assets API Examples section will often refer back to the terms and explanations discussed in this Assets API Overview section.
Note: when calling any of the endpoints in the Assets API you must specify either a catalog ID or a project ID to indicate whether the metadata for an asset is (to be) in a catalog or a project. Because the Assets API endpoints can be applied to either a catalog or a project, rather than repeating the phrase "either a catalog or a project" over and over throughout the rest of this documentation, only the term "catalog" will be used. The possibility of instead using a "project" will be implied.
Asset Primary Metadata Document (or Card)
A primary metadata document is a document that contains the primary metadata for an asset resource. Once a primary metadata document has been created and stored in the catalog, it's often informally said that that asset resource has been "cataloged", or "added to the catalog". Note: being cataloged, or added to the catalog, does not mean the asset resource has been moved or copied and is now physically stored inside the catalog - it just means a primary metadata document has been created for that asset resource, and that primary metadata document is now stored in the catalog.
Almost every Assets API endpoint revolves around creating, reading, modifying or deleting a primary metadata document. JSON is natively used to store primary metadata documents in a catalog, and to transfer those documents in Assets API REST calls. So, JSON examples of primary metadata documents will be used throughout this documentation.
In this documentation, the term card (as in, an index card in a library's catalog) will often be used as a short nickname for the phrase "primary metadata document". In this documentation, "card" and "primary metadata document" mean exactly the same thing. The term "card" just saves us from reading and writing the lengthier phrase "primary metadata document" over and over.
A primary metadata document (ie, card) is a JSON object that's composed of up to three top-level fields, named as follows:
1. [**"metadata"**](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document__metadata_group): a JSON object containing metadata _common to all_ [asset types](#Section_Assets__Overview_and_Terminology__Asset_Type)
2. [**"entity"**](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document__entity_group): a JSON object containing [attributes](#Section_Assets__Overview_and_Terminology__Attributes), each containing metadata _specific to one_ asset type
3. [**"attachments"**](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document__attachments_group): an optional JSON array, each item of which is a JSON object containing _metadata for_ an attached (ie, externally stored) [asset resource](#Section_Asset__Terminology_Overview__definition__Asset_Resource) or [extended metadata document](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document_Overview__attachment__extended_metadata)
For a pictorial representation of a primary metadata document (ie, card) and its associated asset resource and extended metadata documents, see the Parts of a Primary Metadata Document figure below:
In particular, note that:
- red rectangles are used in the figure to highlight the [three top-level fields of a card](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document_Overview__three_top_level_fields_of_a_card).
- the green rectangles illustrate how important the _name_ of the [primary asset type](#Section_Assets__Overview_and_Terminology__Asset_Type__Primary_Asset_Type_definition) is in relating various parts of the card, and the attached [asset resource](#Section_Asset__Terminology_Overview__definition__Asset_Resource), to each other. In the example figure, the value of `"metadata.asset_type"` is "data\_asset". The value you'll see in your card depends on the "asset_type" you've specified for your asset.
"metadata" field of a Primary Metadata Document
The "metadata" field of a primary metadata document (ie, of a card) is a JSON object that contains metadata fields that are common across all types of assets. (See the top red rectangle in the parts figure.) The Assets API specifies the names of the fields that go into the "metadata" part of the card. The user must supply values for some of the fields in "metadata"; the values of other fields in "metadata" will be filled in by the Assets API during the life of the card. Here's a list of some of the fields inside "metadata" (see example cards in the Get Asset section for more extensive lists):
- "asset_id":
- The ID of the card (ie, primary metadata document) rather than of the asset resource described by the card.
- Created internally by the Assets API at the time the card is created. That is, you do not supply this value.
- "asset_type":
- You must supply this value.
- Declares the primary asset type of this card.
- Describes the type of the asset resource attached (if any) to this card.
- Specifies the name of the primary attribute in this card.
- See Asset Types for more details on asset types.
- "asset_attributes":
- You must not supply any value for this field when creating a primary metadata document. The Assets APIs maintain the contents of this field.
- An array of attribute names (only the names, not the actual attributes).
- Each attribute / asset type name listed in this array will have a correspondingly named attribute in the "entity" field of the card.
- The name of each attribute must match the name of an existing asset type, so this is also an array of the names of the primary and secondary / extended asset types used by this card.
- "name": the name of the asset resource this card describes
- "description": a description of the asset resource
- "origin_country": the originating country for the asset resource
- "tags": an array of terms that users want to associate with the asset resource
- "rov": Rules Of Visibility.
- "mode": -1 - this is the default, which corresponds to "mode" : 0, public (see below)
- "mode": 0 - indicates public visibility, in which everybody can view and search the values of the asset's primary metadata document (card), and preview the asset's data. Note: access can still be denied based on actionable governance policy rules.
"rov": {
"mode": 0,
"collaborator_ids": []
}
-
- "mode": 8 - indicates private visibility, which allows users listed as members of the asset (as denoted by
collaborator_ids
list) to view and search all fields (includingmetadata
,entity
, andattachments
) of the asset's primary metadata document (card), and preview the asset's data. Non-members are allowed to only view and search themetadata
field, and cannot preview the asset's data. Note: access can still be denied based on actionable governance policy rules.
- "mode": 8 - indicates private visibility, which allows users listed as members of the asset (as denoted by
"rov": {
"mode": 8,
"collaborator_ids": [
{
"IBMid-06___": {
"user_iam_id": "IBMid-06___"
}
},
{
"IBMid-27___": {
"user_iam_id": "IBMid-27___"
}
}
]
}
-
- "mode": 16 - indicates hidden visibility, in which only users listed as members of the asset (as denoted by
collaborator_ids
list) have any access to fields in the asset, and non-members have no access to the asset. Note: access can still be denied based on actionable governance policy rules.
- "mode": 16 - indicates hidden visibility, in which only users listed as members of the asset (as denoted by
"rov": {
"mode": 16,
"collaborator_ids": [
{
"IBMid-06___": {
"user_iam_id": "IBMid-06___"
}
},
{
"IBMid-27___": {
"user_iam_id": "IBMid-27___"
}
}
]
}
"entity" field of a Primary Metadata Document
The "entity" field of a card (ie, primary metadata document) is a JSON object that contains additional JSON objects called attributes, each of which contains metadata fields that are specific to one asset type. (See the middle red rectangle in the parts figure.) The only contents of the "entity" field are attributes, which are discussed in the next section.
Note: the fact that the "entity" section contains attributes for more than one asset type does not mean that a single card contains metadata for more than one asset resource. A card always contains metadata for exactly one asset resource, and that asset resource will have exactly one attribute associated with it (see primary attribute below). All the other attributes in the "entity" field contain extended metadata describing the single asset resource that the card was created for. Really, asset types ought to be thought of as attribute types because asset types literally define (some of) the fields that will appear in attributes.
Attributes
- is contained directly inside the ["entity"](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document__entity_group) field of the [primary metadata document](#Section_Assets__Overview_and_Terminology__definition__primary_metadata_document).
- is identically named with, and has fields that are partially defined by, an [Asset Type](#Section_Assets__Overview_and_Terminology__Asset_Type)
- describes an [asset resource](#Section_Asset__Terminology_Overview__definition__Asset_Resource) or something related to that asset resource, such as an [extended metadata document](#Section_Assets__Overview_and_Terminology__Asset_Metadata_Document_Overview__attachment__extended_metadata)
There is one attribute in the "entity" field for each attribute name that appears in the "metadata.asset_attributes" array. So, for example, if the "metadata.asset_attributes"
array contains these two attribute names:
"metadata": {
...
"asset_attributes": [
"data_asset",
"data_profile"
],
}
then the "entity" field will contain these two correspondingly named attributes:
"entity": {
"data_asset": { // attribute name matches "data_asset" in "metadata.asset_attributes"
...attribute contents...
},
"data_profile": { // attribute name matches "data_profile" in "metadata.asset_attributes"
...attribute contents...
}
}
The name of each attribute in "entity" must also match the name of an existing asset type. That is, an attribute named "X" will contain metadata related to an asset type also named "X". So, an attribute's name can be thought of as simultaneously telling us that attribute's "type". For example, in this asset metadata document example, both the attribute names "data_asset" and "data_profile" refer to asset types with those same names.
There is one special attribute that will be referred to as the primary attribute. The primary attribute is the main attribute used to describe an asset resource. Every primary metadata document will have exactly one primary attribute. The name of the primary attribute is the same as the name that appears in the "metadata.asset_type"
field.
Any attribute other than the primary attribute is a "secondary" / "extended" attribute whose name must match the name of a secondary / extended asset type. A common example of an attribute for extended metadata is named "data_profile", which is created by the Profiling API. For example, see the underlined names in the Parts of a Primary Metadata Document figure, or the "entity.data_profile"
field in this asset metadata document.
Although the Assets API restricts the names of attribute objects to match the names of asset types, the Assets API does not (in general) specify what the contents of those attributes should be. So, in some sense, the fields within an attribute are the opposite of the fields within the "metadata" field:
- the Assets API "owns" (or, specifies) which fields go inside "metadata"
- the user "owns" (or, specifies) which fields go inside the attributes (except for some fields of already available asset types)
The following example shows two attributes, whose names must match asset types, but whose contents are (for the most part) up to the user:
"entity": {
"data_asset": { // attribute name must match some asset type's name
...
data_asset *type creator* and
data_asset *attribute creator*
decide what fields go here
...
},
"data_profile": { // attribute name must match some asset type's name
...
data_profile *type creator* and
data_profile *attribute creator*
decide what fields go here
...
}
}
Because the Asset Types API is itself the creator of some already available asset types, the Asset Types API specifies some of the fields for any attribute whose name corresponds to one of those already available asset types. For example, see the discussion of the already available asset type called "data_asset".
Note: there is a GET attribute
API that can be used to retrieve just the attributes in the "entity" section of the primary metadata document, instead of the entire primary metadata document as returned by the GET asset
API.
"attachments" (optional) field of a Primary Metadata Document
The "attachments" field of a card (ie, primary metadata document) is a JSON array, each item of which contains metadata for one attachment. (See the bottom red rectangle in the parts figure.)
-
the "attachments" array in the primary metadata document
-
an attachment item in the "attachments" array
-
a metadata document that will be returned from a call to the GET Attachment API. That metadata document will contain information that points to, and can be used to retrieve, either...
-
the asset resource being described by the primary metadata document
-
an extended metadata document stored containing extended metadata for the asset resource
Each attribute in the "entity" field can have a corresponding attachment item in the "attachments" array. An attribute and its corresponding attachment item are related to each other by using the name of the attribute as the value for the attachment item's "asset_type" field. For example, notice in the following card snippet how the attribute name "data_asset" is used to link that "data_asset" attribute to its attachment item in the "attachments" array:
"entity": {
...other attributes
"data_asset": { // <-- attribute's name matches its...
...
},
...other attributes
},
"attachments": [
...other attachment items
{
...
"asset_type": "data_asset", // <-- ...attachment's asset_type
...
"connection_id": "...", // connection_ fields are one way
"connection_path": "...", // that item points to attached object
...
},
...other attachment items
]
Notice also in the above card snippet that, in this case, the attachment item contains two "connection_..." fields that point to the attachment object located in external storage. So, an attribute has an attachment item which points to an attachment object.
Like the fields of "metadata", the fields of an attachment item are specified by the Assets API. Some of the most important fields in an attachment item are:
- "asset_type":
- describes the type of the attachment
- figuratively connects the attachment item to the attribute with the same name
- "connection_id" and "connection_path" (optional):
- this pair of fields specify the ID of a
WDP Connection
and a path in the associated data repository that points to the attached object - always used for an attached asset like a database table
- can also be used for an attached asset resource (eg, spreadsheet) that can be stored in the catalog
- the presence of these two fields means the attachment will be known as a remote attachment
- this pair of fields specify the ID of a
- "object_key" and "handle" (optional):
For any attachment, only one of the following two pairs of fields will be used:
"connection_id"
and"connection_path"
(ie, remote attachment), or"object_key"
and"handle"
(ie, referenced attachment).
Interestingly, being remote does not tell you whether or not an attachment is in the catalog. Remote only tells you how the attached object can be retrieved: by using a connection.
An attachment item (in the card) points to one of two kinds of attached object (in external storage):
-
an asset, or
-
and extended metadata document.
Those are briefly discussed in the next 2 sections.
Asset Resource Attachment
The most typical attachment object is the asset resource being described by the card.
Follow the green arrows in the Parts of a Primary Metadata Document figure to see how:
- the asset's type name leads to
- an attribute name, which leads to
- a primary attribute, which leads to
- an attachment metadata item for that attribute, which finally leads to
- the attached asset resource.
For a full example that shows an attachment metadata item for an attached csv file, see the (only) item in the "attachments" array in Get Asset - CSV File - Response Body - Before Profiling.
Extended Metadata Document Attachment(s)
The other kind of attachment objects are extended metadata documents. A card can have 0, 1, or many attached extended metadata documents. These documents each contain a related set of (additional) metadata describing the asset resource.
See the underlined "data_profile" type name in the Parts of a Primary Metadata Document figure for a visualization of how, for one extended metadata document, the three parts ("metadata", "entity", "attachments") of a card are related to each other.
See the second item in the "attachments" array in Get Asset - CSV File - Response Body - After Profiling for an example showing an attachment item for a "data_profile" extended metadata document.
Uses of "asset_type"
value
From the previous sections, you can see that the "asset_type"
value shows up in:
- the "metadata.asset_type" field
- the "metadata.asset_attributes" array
- a field (ie, object) in the "entity" field. This object is the primary attribute.
- the asset_type field of the primary attribute's attachment (if such an attachment exists, which it typically does). This (primary) attachment will be the asset resource (eg, database table, spreadsheet, csv file, etc.).
For example, see the Parts of a Primary Metadata Document figure above, where the name of the primary attribute is, in this case, "data_asset" and is highlighted with green rectangles in all the places it's used. The path shown by the green arrows in the figure starts at the "metadata.asset_type"
field and ends at the asset resource, in this case a file called Sample.csv.
Other Assets API Objects
Finally, here is a brief list of some of the remaining objects that can be manipulated with the Assets APIs:
- owner
- the owner of the asset
- collaborators
- users who are allowed to see and possibly edit (some parts of) the asset
- perms
- permissions for viewing / editing an asset
- ratings
- indications of how popular or useful the asset is
- stats
- statistics on how often and when the asset was viewed or edited, and who did that viewing or editing.
Getting an Asset
It's important to understand that the GET Asset
API does not return an asset resource like a database table, a spreadsheet, a csv file, etc. Instead, it returns a primary metadata document (ie, card) that describes an asset resource.
Obviously, a primary metadata document (ie, card) must have been created before it can be retrieved. Still, it's instructive to see actual examples of a card and its parts before attempting to create those things. After all, many users will retrieve cards that were previously created by someone else.
This and the following sections show how to retrieve asset metadata and attachments (eg, an asset resource and extended metadata documents).
Getting an Asset - for a Connection
We'll start by retrieving a common primary metadata document (ie, card): one for a "connection" asset type. This is a simple card because it has no attachments. That makes it an easy example to start with, even though many of the other cards you'll encounter do have attachments.
Use the following GET Asset
API to retrieve the primary metadata document for a connection. Note that this requires that you know and supply the IDs of both the primary metadata document (ie, card) and of the catalog that contains the card. Either someone has given you both of those IDs or you can browse to the asset's page using the Watson Knowledge Catalog UI and then extract both the catalog ID and the primary metadata document ID from within the URL in the browser's address bar.
Getting an Asset - Request URL:
GET {service_URL}/v2/assets/{asset_id}?catalog_id={catalog_id}
The following is the primary metadata document (ie, card) that's returned.
Note: you may find it helpful to look at the Parts of a Primary Metadata Document Figure before looking at the following Response Body.
Getting an Asset - Connection - Response Body:
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2018-11-06T17:40:37Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1541526037227,
"last_accessed_at": "2018-11-06T17:40:37Z",
"last_access_time": 1541526037227,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"name": "ConnectionForCSVFile",
"description": "Connection for CSV file",
"tags": [],
"asset_type": "connection",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created": 1541526037227,
"created_at": "2018-11-06T17:40:37Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"connection"
],
"asset_id": "070e9be2-40a8-4e0e-___",
"asset_category": "SYSTEM"
},
"entity": {
"connection": {
"datasource_type": "193a97c1-4475-4a19-b90c-295c4fdc6517",
"context": "source,target",
"properties": {
"bucket": "catalogforgettingsta___",
"secret_key": "{wdpaes}12345___=",
"api_key": "{wdpaes}eo/12345_=",
"resource_instance_id": "crn:v1:bluemix:public:cloud-object-storage:global:a/12345c___:7240b198-b0f6-___::",
"access_key": "12345___",
"region": "us-geo",
"url": "https://s3.us-south.objectstorage.softlayer.net"
},
"flags": []
}
}
}
The above response has two of the three primary groups of metadata that were described in the Primary Metadata Document section: "metadata" and "entity".
As discussed in Assets API Overview section, the contents of the "metadata" field are common to all primary metadata documents (ie, cards). The set of fields in "metadata" is completely defined by the Assets API. The values for some of those fields must be provided by the creator of the card, while other fields' values will be populated by various Assets APIs during the life of the card. Note the following fields' values in particular:
-
"metadata"
fields whose values are provided by the creator of the card:-
"name"
: "ConnectionForCSVFile" -
"description"
: "Connection for CSV file" -
"asset_type"
: "connection" -
"asset_attributes"
: [ "connection"]
-
-
"metadata"
fields whose values are set by various Assets APIs during the life of the card:"usage"
: contains various statistics describing usage of the card/asset"catalog_id"
: the ID of the catalog that contains the card"created_at"
: the time and date at which the card was created"asset_id"
: the ID of the card (not the asset resource)
For more info about the "metadata"
fields, see the discussion on "metadata" in the Assets API Overview section above.
The contents of the "entity"
field are only partially defined by the Assets API. In particular, the "entity"
field shown in the above card contains a field whose name must match the value in "metadata.asset_type"
, in this case, "connection"
. That field is the primary attribute.
On the other hand, both the names and the values of all the fields inside the primary attribute "entity.connection"
are completely determined by the creator of the "connection" asset type and the creator of the "connection" attribute. The Assets API does not, in general, decide what fields go inside the primary attribute (or any other attribute). In the example "connection" attribute above, some of the more interesting fields are:
"datasource_type"
- specifies the ID of the type of the data source to which a connection will be formed."properties"
- specifies connection metadata specific to the type of the datasource. The exact contents of this field will change according to the type of the datasource.
For more info on the contents of "entity"
in general, see the discussion on "entity" in the Assets API Overview section.
Notice the above card contains no "attachments" array. That means there is no attached asset resource associated with this card. A natural question is: how can "connection" asset metadata exist for, or describe, a non-existent "connection" asset resource? Actually, a "connection" asset resource does exist, but only when the metadata in the connection's primary metadata document is used to create a client-server connection at runtime.
Getting an Asset - for a CSV File
This section shows a far more typical example in which the primary metadata document (ie, card) does have an attached asset resource - in this case, a csv file named Sample.csv. Here's the very simple contents of the Sample.csv file:
Sample.csv file contents
Name,Number
abc,123
def,456
Use the GET Asset
API to retrieve the asset metadata for the Sample.csv asset resource. Note: the GET Asset
API only returns a primary metadata document (ie, card) that describes the Sample.csv file - it does not return the actual Sample.csv file.
Getting an Asset - Request URL:
GET {service_URL}/v2/assets/{asset_id}?catalog_id={catalog_id}
It's instructive to show two different versions of the primary metadata document for the Sample.csv asset:
- Before profiling (which returns a small metadata document - without extended metadata)
- After profiling (which returns a much larger metadata document - with extended metadata)
Note: you may find it helpful to look at the Parts of a Primary Metadata Document Figure before looking at either of the following two Get Asset Response Bodies.
Here is the smaller primary metadata document that exists before the Profile API is invoked on the Sample.csv file.
Getting an Asset - CSV File - Response Body - Before Profiling:
{
"metadata": {
"name": "Sample.csv",
"description": "A simple csv file.",
"asset_type": "data_asset",
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2018-11-06T17:45:23Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1541526323713,
"last_accessed_at": "2018-11-06T17:45:23Z",
"last_access_time": 1541526323713,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"origin_country": "united states",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created": 1541526321437,
"created_at": "2018-11-06T17:45:21Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"data_asset"
],
"asset_id": "45f4ab8c-37d5-45a1-8adf-___",
"asset_category": "USER"
},
"entity": {
"data_asset": {
"mime_type": "text/csv",
"dataset": false
}
},
"attachments": [
{
"id": "b8c7a390-e857-4c34-add8-___",
"version": 2,
"asset_type": "data_asset",
"name": "remote",
"description": "remote",
"connection_id": "070e9be2-40a8-4e0e-___",
"connection_path": "catalogforgettingsta-datacatalog-r1s___/data_asset/Sample_SyjEQUy6m.csv",
"create_time": 1541526323713,
"size": 0,
"is_remote": true,
"is_managed": false,
"is_referenced": false,
"is_object_key_read_only": false,
"is_user_provided_path_key": true,
"transfer_complete": true,
"is_partitioned": false,
"complete_time_ticks": 1541526323713,
"user_data": {},
"test_doc": 0,
"usage": {
"access_count": 0,
"last_accessor_id": "IBMid-___",
"last_access_time": 1541526323713
}
}
]
}
The above primary metadata document has all three primary groups of metadata ("metadata", "entity", and "attachments") that were described in the Assets API Overview section.
The contents of the "metadata" field are very similar to those shown above for the Connection card example. The most important difference is the value that the user specified as the "asset type" for the Sample.csv asset, namely "data_asset"
. That asset type name shows up in two places inside the "metadata" section of the primary metadata document:
"metadata"
:"asset_type"
: "data_asset""asset_attributes"
: [ "data_asset" ]
As discussed in the Attributes section, the fact that "metadata.asset_type"
has the value "data_asset"
means the "entity" field of the card must contain a primary attribute called "data_asset"
. The Asset Types API provides the predefined asset type "data_asset". That "data_asset"
type definition declares that there are two mandatory fields in a "data_asset"
attribute: "mime_type"
and "dataset"
, as can be seen in the card above and repeated here:
"entity"
:"data_asset"
:"mime_type"
: "text/csv"- specifies the mime type of the asset resource. Here, the mime type indicates that the asset resource is a text csv file.
"dataset"
: false- false because there is no "columns" field in this primary attribute.
- Note: false does not mean there are no columns in the asset resource. Clearly, our Sample.csv file does have columns. The problem here is that no one has (yet) told the card that the asset resource has columns. Compare this "data_set" attribute to the one shown in the next example Get Asset - CSV File - Response Body - After Profiling, where the value of "dataset" has been changed to true, and the primary attribute does have a "columns" field.
Unlike in the Connection card example above, the card for the Sample.csv file does have an "attachments"
field. In this case, the "attachments" array has one item in it. That item contains metadata that points to the attached asset resource (ie, the Sample.csv file). Some of the more interesting fields in that attachment item are:
"id"
: "b8c7a390-e857-4c34-add8-___"- identifies the metadata document that points to the attached asset resource
"asset_type"
: "data_asset"- matches the name of the primary attribute in "entity", so linking the primary attribute to this attachment item and designating this item as the item that points to the asset resource.
"connection_id"
: "070e9be2-40a8-4e0e-___"- identifies a connection primary metadata document (ie, card) which contains credentials and other info that can be use to connect to the external repository that contains the attached asset resource (ie, the "Sample.csv" file)
- not coincidentally, the particular connection card referred to by "070e9be2-40a8-4e0e-___" is the exact same connection card shown above in Get Asset - Connection Primary Metadata Document
"connection_path"
: "catalogforgettingsta-datacatalog-r1s___/data_asset/Sample_SyjEQUy6m.csv",- identifies the path in the external repository that contains the attached asset (ie, the "Sample.csv" file)
"is_remote"
: true- as discussed in the "attachments" overview section, is_remote is true because "connection_id" and "connection_path" are being used to describe how to get the Sample.csv asset resource.
"is_referenced"
: false (at most one of "is_referenced" and "is_remote" will be true)
Getting an Asset - CSV File - Response Body - After Profiling:
Now, let's compare what GET {service_URL}/v2/assets/{asset_id}?catalog_id={catalog_id}
returns for the same asset after the Profile API has been invoked on the Sample.csv file:
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"name": "Sample.csv",
"description": "Simple csv file for experiment for getting started document.",
"tags": [],
"asset_type": "data_asset",
"origin_country": "united states",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created": 1541526321437,
"created_at": "2018-11-06T17:45:21Z",
"owner_id": "IBMid-___",
"size": 9238,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"data_asset",
"data_profile"
],
"asset_id": "45f4ab8c-37d5-45a1-8adf-___",
"asset_category": "USER"
},
"entity": {
"data_asset": {
"mime_type": "text/csv",
"dataset": true,
"columns": [
{
"name": "Name",
"type": {
"type": "varchar",
"length": 1024,
"scale": 0,
"nullable": true,
"signed": false
}
},
{
"name": "Number",
"type": {
"type": "varchar",
"length": 1024,
"scale": 0,
"nullable": true,
"signed": false
}
}
]
},
"data_profile": {
"971e9c66-be4c-44b4-91f3-___": {
"metadata": {
"guid": "971e9c66-be4c-44b4-91f3-___",
"asset_id": "971e9c66-be4c-44b4-91f3-___",
"dataset_id": "45f4ab8c-37d5-45a1-8adf-___",
"catalog_id": "c6f3cbd8-2b7f-42fb-aa60-___",
"created_at": "2018-11-12T15:32:53.902Z",
"accessed_at": "2018-11-12T15:32:53.902Z",
"owner_id": "IBMid-___",
"last_updater_id": "IBMid-___"
},
"entity": {
"data_profile": {
"options": {
"disable_profiling": false,
"max_row_count": 5000,
"max_distribution_size": 100,
"max_numeric_stats_bins": 200,
"classification_options": {
"disabled": false,
"use_all_ibm_classes": true,
"ibm_class_codes": [],
"custom_class_codes": []
}
},
"execution": {
"status": "finished",
"is_supported": true,
"dataflow_id": "3f1ace02-4d40-451d-9bc7-___",
"dataflow_run_id": "f774f92f-5a61-49ca-8a68-___"
},
"columns": [],
"attachment_id": "8d614be0-6900-403b-ab50-___"
}
}
},
"attribute_classes": [
"NoClassDetected",
"Organization Name"
]
}
},
"attachments": [
{
"id": "b8c7a390-e857-4c34-add8-___",
"version": 2,
"asset_type": "data_asset",
"name": "remote",
"description": "remote",
"connection_id": "070e9be2-40a8-4e0e-___",
"connection_path": "catalogforgettingsta-datacatalog-r1s___/data_asset/Sample_SyjEQUy6m.csv",
"create_time": 1541526323713,
"size": 0,
"is_remote": true,
"is_managed": false,
"is_referenced": false,
"is_object_key_read_only": false,
"is_user_provided_path_key": true,
"transfer_complete": true,
"is_partitioned": false,
"complete_time_ticks": 1541526323713,
"user_data": {},
"test_doc": 0,
"usage": {
"access_count": 0,
"last_accessor_id": "IBMid-___",
"last_access_time": 1541526323713
}
},
{
"id": "8d614be0-6900-403b-ab50-___",
"version": 2,
"asset_type": "data_profile",
"name": "data_profile_971e9c66-be4c-44b4-91f3-___",
"object_key": "data_profile_971e9c66-be4c-44b4-91f3-___",
"create_time": 1542036813627,
"size": 9238,
"is_remote": false,
"is_managed": false,
"is_referenced": true,
"is_object_key_read_only": false,
"is_user_provided_path_key": true,
"transfer_complete": true,
"is_partitioned": false,
"complete_time_ticks": 1542036813627,
"user_data": {},
"test_doc": 0,
"handle": {
"bucket": "catalogforgettingsta-datacatalog-r1s___",
"location": "us-geo",
"key": "data_profile_971e9c66-be4c-44b4-91f3-___",
"upload_id": "done",
"max_part_num": 1
},
"usage": {
"access_count": 0,
"last_accessor_id": "iam-ServiceId-12345___",
"last_access_time": 1542036813627
}
}
]
}
Let's look at a few of the most important differences between the primary metadata document for the Sample.csv file before and after profiling:
-
"metadata"
:"asset_attributes"
: [ "data_asset", "data_profile" ]- Note the "data_profile" attribute name has been added
-
"entity"
:-
"data_asset"
:"columns"
: the Profile API has added the"columns"
field to thedata_asset
attribute,"dataset"
: the Profile API caused this to change from false to true because of the newly added"columns"
field
-
"data_profile"
:- this
"data_profile"
attribute is entirely new, and was added by the Profile API. - the name of this secondary attribute matches the name of the secondary asset type "data_profile", which was (previously) created by the Profile API.
- the contents of this
"data_profile"
attribute was entirely decided by the Profile API, not by the Assets API. - this attribute contains a lot of extended metadata about the "data_profile" run that produced a
"data_profile"
extended metadata document.
- this
-
-
"attachments"
:- a new item has been added to the
"attachments"
array - that new item contains the following
metadata
about an extended metadata document:"id"
: "8d614be0-6900-403b-ab50-___""asset_type"
: "data_profile"- note that the value "data_profile" matches the name of the "data_profile" attribute that this attachment item belongs to, so linking the attachment item and the attribute.
"handle"
: contains various fields pointing to the actual attached extended metadata document which is located in some external repository. That extended metadata document will contain a great deal more metadata about the asset resource, that is, about the "Sample.csv" file.
- a new item has been added to the
The next section shows how to retrieve the Extended Metadata Document that's referred to by the new "data_profile" "attachments"
item just described above.
Get Attachment - Extended Metadata Document:
The following example builds on the GET Asset
example from the previous section and shows how to retrieve an attachment that is an extended metadata document.
An attachment can be retrieved in 4 steps.
The only choices you have for asset_type in a given primary metadata document are listed in that document's "metadata.asset_attributes"
field. In the example above those values are:
- "data_asset"
- "data_profile"
The asset_type of the extended metadata document we want is "data_profile".
Step 2: Get the "id"
of the "attachments"
item whose "asset_type"
field has the value you chose in Step 1.
In the primary metadata document, look for the only "attachments"
item whose "asset_type"
field has the value you chose in Step 1, namely "data_profile". In our example primary metadata document above, that "attachments"
item has the "id"
value "8d614be0-6900-403b-ab50-___".
Step 3: Invoke the Get Attachment
API to get attachment metadata for the attached extended metadata document.
Get Asset Attachment - Request URL
GET /v2/assets/{asset_id}/attachments/{attachment_id}
The values for the above URL parameters are obtained as follows:
-
{asset_id}
: is the same as what appears in the"metadata.asset_id"
field of the above primary metadata document, namely "45f4ab8c-37d5-45a1-8adf-___" -
{attachment_id}
is the of"id"
that was obtained in Step 2, namely "8d614be0-6900-403b-ab50-___".
Invoke the above GET Attachment
API with the above values, which will return an attachment metadata document as shown in the following response body:
Get Asset Attachment - Response Body:
{
"attachment_id": "8d614be0-6900-403b-ab50-___",
"asset_type": "data_profile",
"is_partitioned": false,
"name": "data_profile_971e9c66-be4c-44b4-91f3-___",
"created_at": "2018-11-12T15:33:33Z",
"object_key": "data_profile_971e9c66-be4c-44b4-91f3-___",
"object_key_is_read_only": false,
"bucket": {
"bucket_name": "catalogforgettingsta-datacatalog-r1s___",
"bluemix_cos_connection": {
"viewer": {
"bucket_connection_id": "5b6bc03d-577d-4609-b3a4-___"
},
"editor": {
"bucket_connection_id": "070e9be2-40a8-4e0e-a468-___"
}
}
},
"url": "https://s3.us-south.objectstorage.softlayer.net/catalogforgettingsta-datacatalog-r1s___/data_profile_971e9c66-be4c-44b4-91f3-___?response-content-disposition=attachment%3B%20filename%3D%22data_profile_971e9c66-be4c-44b4-91f3-___%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20190423T162446Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86400&X-Amz-Credential=d2d518b66ac64de___%2F2019___%2Fus-geo%2Fs3%2Faws4_request&X-Amz-Signature=ce7322d7291396c511a6df38635df4e85b7c78c173___",
"transfer_complete": true,
"size": 9238,
"user_data": {},
"creator_id": "iam-ServiceId-12345___",
"usage": {
"access_count": 1,
"last_accessor_id": "IBMid-___",
"last_access_time": 1556036686480
}
}
It's important to understand that the GET Attachment
API only returns a metadata document that describes where, or how, an attached asset resource or extended metadata document can be accessed or retrieved.
The most important field in the above response is "url"
which contains a signed URL that can be used to retrieve the actual extended metadata document. Note that the "url"
points to a completely different server than the server that responds to "Assets API" calls! Extended metadata documents are not stored in the catalog.
Step 4: Use the "url"
in the response from Step 3 to call the relevant server to get the extended metadata document.
The simplest way to use that "url"
value is to paste it into the address bar of a browser, and let the browser retrieve the extended metadata document. Here's a peek at some of the contents of the large extended metadata document that can be retrieved using that "url"
value. That large extended metadata document was created by the Profile API and contains a great deal of extended metadata about our small Sample.csv file:
{
"summary": {
"version": "1.9.3",
"row_count": 2,
"score": 1,
"score_stats": {
"n": 2,
"mean": 1.0,
"variance": 0.0,
"stddev": 0.0,
"min": 1.0,
"max": 1.0,
"sum": 2.0
},
...
},
"columns": [{
"name": "Name",
"value_analysis": {
"distinct_count": 2,
"null_count": 0,
"empty_count": 0,
"unique_count": 2,
"max_value_frequency": 1,
"min_string": "abc",
"max_string": "def",
"inferred_type": {
"type": {
"length": 3,
"precision": 0,
"scale": 0,
"type": "STRING"
}
},
...
}, {
"name": "Number",
"value_analysis": {
"distinct_count": 2,
"null_count": 0,
"empty_count": 0,
"unique_count": 2,
"max_value_frequency": 1,
"min_string": "123",
"max_string": "456",
"min_number": 123.0,
"max_number": 456.0,
"inferred_type": {
"type": {
"length": 3,
"precision": 3,
"scale": 0,
"type": "INT16"
}
},
...
]
}
Get Attachment - Asset Resource:
The 4 steps given above to retrieve an extended metadata document can also be used to retrieve an asset resource like the Sample.csv file example.
The main difference is that in Step 1 you would choose the asset_type "data_asset" because that is the primary asset type of the primary metadata document, ie. the asset_type that identifies both the primary attribute and the primary attachment, ie, the asset resource.
Create Asset: book
Before you can create a primary metadata document (ie, card) the asset type that you want to use for that card must already exist. You can use one of the already available asset types, or you can use an asset type that you have created.
The Create Asset Type: book section shows how to create an asset type named book
. In this section, that asset type will be used to create a primary metadata document for a book asset resource. That primary metadata document will have:
- a
"metadata.asset_type"
field with the value"book"
- a primary attribute called
"book"
.
Use the following endpoint to create a primary metadata document for a book asset resource:
Create Asset: book - Request URL:
POST {service_URL}/v2/assets?catalog_id={catalog_id}
Create Asset: book - Request Body:
{
"metadata": {
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": ["getting", "started", "documentation"],
"asset_type": "book",
"origin_country": "us",
"rov": {
"mode": 0
}
},
"entity": {
"book": {
"author": {
"first_name": "Tracy",
"last_name": "Smith"
},
"price": 29.95
}
}
}
The above request body specifies the preliminary contents for the primary metadata document about to be created. Most of the fields have been described previously in the Asset's Primary Metadata Document section. However, there are a few things to note in particular about the above request:
"metadata"
: you supply the values of only some of the fields that will end up appearing inside the"metadata"
field of the primary metadata document about to be created, including:"asset_type"
: the value"book"
matches the name of the asset type for this document"name"
: the name to use for the asset being described by this document"description"
: a description for the asset
Notice that you do not supply a "metadata.asset_attributes"
field in the request body. If you include a "metadata.asset_attributes"
field in your Create Asset request body then the request will be rejected because it tried to supply a reserved value. The Assets API reserves control of the contents of the "metadata.asset_attributes"
field.
"entity"
: you supply the entire contents of the"entity"
field"book"
:- this is the primary attribute of the primary metadata document
- the name of this attribute matches the name of the corresponding primary asset type "book"
- contains metadata describing a book (does not contain the actual book asset resource)
Notice the above "book"
attribute doesn't contain a field called "title" - a field which might be expected in an attribute for a book. In this case, we've chosen to put the title of the book in the "metadata.name"
field of the card. However, the creator of the "book"
attribute is free to include whatever fields they want in that attribute, including a field called "title" if desired.
Create Asset: book - Response Body:
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2019-04-30T14:37:57Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1556635077746,
"last_accessed_at": "2019-04-30T14:37:57Z",
"last_access_time": 1556635077746,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": [
"getting",
"started",
"documentation"
],
"asset_type": "book",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-___",
"created": 1556635077746,
"created_at": "2019-04-30T14:37:57Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 2,
"asset_state": "available",
"asset_attributes": [
"book"
],
"asset_id": "3da5389d-d4a4-43da-be1f-___",
"asset_category": "USER"
},
"entity": {
"book": {
"author": {
"first_name": "Tracy",
"last_name": "Smith"
},
"price": 29.95
}
}
"asset_id": "3da5389d-d4a4-43da-be1f-___"
}
Notice that the card returned in the Create Asset Response Body has many more fields than were present in the Request Body. The Create Asset API has added a lot of information to the "metadata"
part of the primary metadata document:
"asset_id"
: most importantly, the Create Asset API has given your primary metadata document an id"owner_id"
: the API has made the caller of the API be the owner of the asset"created_at"
: the API has recorded the time at which the metadata document was created. In general, this is not the same as the time at which an attached asset resource was created (although in this case there is no attached asset resource)."total_ratings"
: contains the number of ratings this asset has recieved. 0 for now because the primary metadata document is brand new."usage"
: usage statistics. Since this is a brand new card these statistics don't yet contain much interesting data."asset_attributes"
: notice that the Create Asset API has added the name of the primary attribute to this array.
On other hand, notice that the Create Asset API did not modify the contents of the "entity"
field in any way. In particular, the Create Asset API did not modify the contents of the primary attribute "book"
.
Your catalog now contains a primary metadata document for a "book" asset resource.
Duplicate Asset
Duplicate Asset Overview
When a CAMS call tries to create an asset (e.g. create a new asset, promote/publish/clone an asset, etc.), CAMS can optionally detect pre-existing duplicate assets and take appropriate actions based on configurations and query parameters, e.g., ignoring the duplicates and create a new asset, or failing the call and returning an error saying duplicates were found, or updating the existing duplicate.
Planned from CPD 4.7, when a CAMS call tries to update an asset, CAMS will also optionally detect if the asset with the incoming change would have any duplicates if the change were persisted and take appropriate actions based on configurations and query parameters, e.g., ignoring the duplicates and update the asset, or failing the call and returning an error saying the change would result in duplicates.
This process is called duplicate asset processing. This section describes how the duplicate asset processing works in CAMS and how you can make it work in the ways you desire.
What is a duplicate
An asset is considered a duplicate if it fits any of the following scenarios:
- Original asset - the asset that the incoming asset was originally cloned/published from.
For instance, if you cloned/published an asset A to a project/catalog and resulted in asset B, and then try to publish/clone the asset B back to the original catalog/project, the asset A will be seen as the original asset and considered as a duplicate.
- Copies of the same asset - an asset is cloned/published/promoted from the same asset as the incoming asset
For example, if you cloned/published/promoted an asset A to a project/catalog/space and resulted in asset B, and then try to clone/publish/promote the asset A again to the same target project/catalog/space, the asset B will be seen as the copy of the same asset and considered as a duplicate.
- Asset with the same values - an asset has the same values as the incoming asset based on the effective duplicate detection strategy of the asset type. See Duplicate Detection Strategy section for more details about duplicate detection strategy.
Let's say that the effective duplicate detection strategy of the asset type data_asset
in a project is DUPLICATE_DETECTION_BY_NAME
(i.e., the duplicate detection will base on the metadata.name
field). If you try to create an asset of type data_asset
with the name KPIReport2021
in this project, and there is an existing asset A of type data_asset
with the same name, then the asset A will be considered as a duplicate.
What to do with a duplicate
Users can set the configuration duplicate_action
of the asset containers and/or specify the query parameter duplicate_action
while calling endpoints to control how the service handles duplicate assets. The valid values of duplicate_action
for calls creating a new asset are:
IGNORE
- ignore the duplicates and create a new assetREJECT
- fail the call and return an error response similar to the one below (no asset will be created):
{
"trace": "290c281c-4adc-4e40-aa49-aaf7cd2dbf6a",
"errors": [
{
"code": "already_exists",
"message": "ASTSV3040E: Duplicate assets exist. '[cc5f7412-5c96-4d66-9c14-40b3c944ad79, 244a3612-63a8-4140-9423-f40841be33ee]'"
}
]
}
UPDATE
- update the duplicate with the incoming changes. See Multiple duplicates for how to choose the duplicate for updating if more than one duplicate is found. See here on how the duplicate asset is updated.REPLACE
- overwrite the duplicate with the input values. See Multiple duplicates for how to choose the duplicate for overwriting if more than one duplicate is found. See here on how the duplicate asset is overwritten.
The valid values of duplicate_action
for calls making changes to an existing asset (including restoring a deleted asset) are:
IGNORE
- ignore the duplicates and update the assetREJECT
- fail the call and return an error response similar to the one below. The asset will not be updated or restored.
{
"trace": "8ea27315-c958-435a-b415-e3632a664dbc",
"errors": [
{
"code": "already_exists",
"message": "ASTSV3128E: The asset will have duplicate assets with IDs '[a0a54f65-8a3d-4f22-95b7-5456d8e43a71]' after saving the change."
}
]
}
Or the below for restoring an asset. Note that the reason that the id of the asset that will have duplicates is also in the message is that restoring an asset may potentially restore multiple related assets and it is possible that some of these related assets will have duplicates.
{
"trace": "7b2668a4-c794-4720-83dd-f4f16e9d0a04",
"errors": [
{
"code": "already_exists",
"message": "ASTSV3128E: The asset '8c740814-5fc0-4294-a2be-f042f15098f1' will have duplicate assets with IDs '[a0a54f65-8a3d-4f22-95b7-5456d8e43a71]' after being restored."
}
]
}
UPDATE
andREPLACE
are not allowed to be used for the query parameterduplicate_action
for calls updating assets. However, if the query parameterduplicate_action
is not supplied and the configurationduplicate_action
is set to one of these values in the asset container level, they will be the effective value of theduplicate_action
. In which case, they will be treated the same asREJECT
.
The configuration duplicate_action
can be set in the asset container level during the creation of a container and can be modified later by using the endpoint PUT /v2/asset_containers/configurations
. If the configuration duplicate_action
is not specified in an asset container, it will be equivalent to IGNORE
.
The following example shows how to supply the configuration duplicate_action
(along with other duplicate asset processing related configurations) while creating a catalog:
{
"name": "my catalog",
...
"configurations": {
"duplicate_action": "REJECT",
...
}
}
The following example shows how to update the configuration by using the endpoint PUT /v2/asset_containers/configurations
:
{
"duplicate_action": "REPLACE",
...
}
The configuration duplicate_action
can be overwritten by a query parameter duplicate_action
for individual calls to control how CAMS handles duplicates for these particular calls. The endpoints that support the query parameter are listed below. Note that the allowed options may differ from endpoint to endpoint depending on if the endpoint supports all available options.
POST /v2/assets
POST /v2/data_assets
POST /v2/assets/{asset_id}/publish
POST /v2/assets/{asset_id}/clone
POST /v2/assets/{asset_id}/promote
POST /v2/assets/{asset_id}/deepcopy
POST /v2/assets/bulk_create
POST /v2/assets/bulk_patch
PATCH /v2/assets/{asset_id}
POST /v2/assets/{asset_id}/attributes
PATCH /v2/assets/{asset_id}/attributes/{attribute_key}
DELETE /v2/assets/{asset_id}/attributes/{attribute_key}
POST /v2/assets/{asset_id}/attachments
DELETE /v2/assets/{asset_id}/attachments/{attachment_id}
POST /v2/trashed_assets/{asset_id}/restore
Duplicate Detection Strategy
The duplicate detection strategy defines what fields to be used for determining if assets of a particular asset type are duplicates. The available duplicate detection strategies are:
DUPLICATE_DETECTION_BY_NAME
- themetadata.name
fieldDUPLICATE_DETECTION_BY_RESOURCE_KEY
- themetadata.resource_key
fieldDUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY
- themetadata.name
and themetadata.resource_key
fieldsDUPLICATE_DETECTION_NOT_APPLICABLE
- no duplicate will be determined. If this strategy is used, no duplicate will be determined by the strategy. At the same time, it also disables the duplicate detection fororiginal asset
andcopies of the same asset
. In other words, it disables duplicate detection for assets of the asset type completely.
The duplicate detection strategy can be set in several levels as shown below (from the highest priority to the lowest priority). Setting with a higher priority will take precedence over the setting with a lower priority.
- Strategy of the asset type defined in the asset container configuration, e.g.,
{
"duplicate_strategies": [
{
"asset_type": "data_asset",
"strategy": "DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY"
}
]
}
- Strategy specified in the asset type definition, e.g.,
{
"description": "Job Run",
"fields": [],
"identity": {
"strategy": "DUPLICATE_DETECTION_NOT_APPLICABLE"
}
}
-
System default strategy of the asset type in the residing asset container. i.e.,
- Project/Space:
connection
:DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY
- Catalog:
connection
:DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY
data_asset
:DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY
-
The
default_duplicate_strategy
specified in the asset container configuration (which applies to all asset types), i.e.,
{
"default_duplicate_strategy": "DUPLICATE_DETECTION_BY_NAME"
}
- System default strategy
DUPLICATE_DETECTION_BY_NAME
(which applies to all asset types)
Multiple duplicates
It is possible that a call may find more than one duplicate for a given asset. It could be that the duplicates were created before CAMS started processing duplicate assets, or the duplicates were created because multiple calls were creating the same asset at the same time.
If the effective value for duplicate_action
is REJECT
, CAMS will fail the call and return the asset ids of all the duplicates in the error response. If the effective value for duplicate_action
is UPDATE
or REPLACE
, CAMS will rank the duplicates based on the following order and choose the duplicate that has the highest score and the caller has permission to update as the target for updating or overwriting. When all things are equal, the duplicates will be ranked by the created time (i.e., the metadata.created
field) from the earliest to the most recent.
- Original asset
- Copies of the same asset
- Asset with the same values
How the duplicate asset is updated or overwritten
If the effective value of duplicate_action
is UPDATE
or REPLACE
, the following information of the duplicate asset that has the highest score and the caller has permission to update will be updated or overwritten with the values from the incoming asset based on the rules. Any other aspects of the duplicate asset will not be updated or overwritten, e.g., privacy settings, collaborators, owners, source asset, ratings, revisions, non-inline relationships, related assets, etc.
- metadata
metadata.name
metadata.description
metadata.resource_key
metadata.origin_country
metadata.tags
- attributes (e.g.,
entity.data_asset
,entity.data_profile
,entity.column_info
, etc.) - attachments (i.e.,
attachments[*]
)
If duplicate_action
is UPDATE
, the chosen duplicate asset will be updated as described below. The result will look the same as the duplicate asset but with the incoming changes.
- The listed metadata fields will be replaced with the values from the incoming asset if the value from the incoming asset is not
null
- All attributes in the incoming asset will be copied over and any existing attribute will be replaced. Some exceptions may apply, e.g.,
- If the existing or incoming attribute
connection
represents a reference connection, the existing attributeconnection
will not be replaced. - Reference copy terms are preserved
- If the existing or incoming attribute
- All attachments in the incoming asset will be copied over. Any existing attachments with the same
asset_type
as the attachments in the incoming asset will be removed.
If duplicate_action
is REPLACE
, the chosen duplicate asset will be updated as described below. The result will look the same as the incoming asset with some exceptions.
- All listed metadata fields will be replaced with the values from the incoming asset (even if the value from the incoming asset is
null
) - All existing attributes will be removed and all attributes in the incoming asset will be copied over with the exception of the attribute
connection
. If the existing or incoming attributeconnection
represents a reference connection, the existing attributeconnection
will remain unchanged. - All existing attachments will be removed and all attachments in the incoming asset will be copied over
Let's say you have an existing asset that looks like below
{
"metadata": {
"name":"B",
"description":"C",
"tags":[
"confidential"
]
},
"entity": {
"data_asset": {
"mime_type": "binary"
},
"something": {}
},
"attachments":[
{
"asset_type": "data_asset",
"name": "attachment 1"
},
{
"asset_type": "something",
"name": "attachment 2"
}
]
}
and you try to add the following asset
{
"metadata": {
"name":"A",
},
"entity": {
"data_asset": {
"mime_type": "text/csv"
}
},
"attachments":[
{
"asset_type": "data_asset",
"name": "attachment 3"
}
]
}
If the effective duplicate_action
is UPDATE
, the existing asset will be modified to be the following
{
"metadata": {
"name":"A",
"description":"C",
"tags":[
"confidential"
]
},
"entity": {
"data_asset": {
"mime_type": "text/csv"
},
"something": {}
},
"attachments":[
{
"asset_type": "something",
"name": "attachment 2"
},
{
"asset_type": "data_asset",
"name": "attachment 3"
}
]
}
If the effective duplicate_action
is REPLACE
, the existing asset will be modified to be the following
{
"metadata": {
"name":"A"
},
"entity": {
"data_asset": {
"mime_type": "text/csv"
}
},
"attachments":[
{
"asset_type": "data_asset",
"name": "attachment 3"
}
]
}
Backup revision
When the best duplicate is updated as a result of asset duplicate processing, a revision is created in case we want to go back or review what was changed. The revision will contain commit information similar to below:
{
"committed_at": "2020-11-17T08:13:39.103Z",
"commit_message": "Backup prior to update the best duplicate",
"reason": "update_duplicate",
"duplicate_source": {
"operation": "clone",
"asset_id": "ca0007d9-051d-478f-b87f-82f38fc6997c",
"catalog_id": "c548b021-e026-49a6-aa60-35fe478afdb5"
}
}
The asset in the response will contain the previous_revision
field in such a case. Which can be used to determine if the call indeed created a new asset or updated an existing asset.
{
"metadata": {
...,
"commit_info": {
"previous_revision": 1
}
},
"entity": {
...
},
"asset_id": "c1bf6686-836c-4f93-b173-c2ed52da8e76"
}
Check duplicates before creating an asset
The duplicate asset processing automatically kicks in when a CAMS call tries to create an asset. However, in some cases, you may want to check possible duplicates before creating an asset or before publish/clone/promote/deepcopy an asset. CAMS provides an endpoint POST /v2/assets/duplicates/search
to help you do this. You can either supply an existing asset or an asset payload to check the duplicates. The endpoint will list all the potential duplicates and why they were considered duplicates.
Lineage and Activity event messages change
If the query parameter duplicate_action
is UPDATE
or REPLACE
and duplicate assets are found, the calls will change from creating a new asset to updating an existing asset. As a result, the corresponding Lineage and Activity event messages will also change from creating
an asset to updating
an asset.
Known issues/limitations
Duplicate assets may be created due to race condition
When a call is made to create an asset, CAMS searches for potential duplicates and creates the asset only if no duplicate is found. If multiple calls are made at the exact same time to create the same asset, it is possible to result in a duplicate. For example, two calls try to create the same asset at the same time, and both calls search for potential duplicates and do not find any duplicates; both calls think there is no duplicate and create the asset. As the result, the same asset will be created twice and each asset is a duplicate of the other.
There is currently no way to prevent such cases from happening. If this happened, the user would have to choose one of these assets and delete others. Internally, CAMS always favours the asset that has the earliest value in the metadata.created
field and chooses it for future duplicate updating/overwriting operations.
Troubleshooting
Duplicate assets are not detected
Sometimes you may see assets that look like duplicates but are not detected as duplicates. You can call the GET /v2/assets/{asset_id}
API to get the JSON representation of these assets and compare the fields that are used for identifying duplicates. If the fields are not the same, then the assets are not duplicates. Otherwise, it may fall into one of the situations where real duplicates are not detected. Please see the Known issues/limitations section for more details.
For instance, if the effective duplicate detection strategy is DUPLICATE_DETECTION_BY_NAME_AND_RESOURCE_KEY
, you can compare the metadata.name
and the metadata.resource_key
fields of these assets and see if they are indeed the same; if the effective duplicate detection strategy is DUPLICATE_DETECTION_BY_NAME
, you can compare the metadata.name
field of these assets. For details about what fields are used for a strategy, please see Duplicate detection strategy section.
Asset Types
Asset Types serve multiple purposes in the Assets API. Asset types fall into two categories:
-
Primary asset type:
- describes the primary type of an asset
- every primary metadata document (ie, card) will have exactly one primary asset type, whose name will be stored in the card's
"metadata.asset_type"
field - every card will have exactly one primary attribute whose name matches the name of the primary asset type
- a very common example of a primary asset type is the "data_asset" type, examples of which are shown throughout this documentation
-
Secondary / Extended asset type:
- a secondary / extended asset type describes an inter-related group of additional metadata for an asset resource
- a primary metadata document can have 0, 1, or many secondary / extended asset types
- information for a secondary / extended asset type is stored in a secondary / extended attribute in a primary metadata document
- a very common example of a secondary / extended asset type is "data_profile". See Get Asset - CSV File - Response Body - After Profiling for an example "data_profile" attribute.
The names of various asset types are used in the following ways, all at once, within a single primary metadata document:
- describe the type of an asset resource, via the
"metadata.asset_type"
field - describe the type of an object that contains extended information for an asset resource. For example, the type of an extended metadata document via an
"attachments[_].asset_type"
field. - assign names and types to attributes in the "entity" field of a primary metadata document
- implicitly tie various related parts of a primary metadata document to each other. For example, see the green rectangles and arrows in the Parts of a Primary Metadata Document Figure.
The content, or definition, of an asset type serves the following purposes:
- tell the catalog what fields of an attribute should be indexed for searching
- specify search paths and cross attribute searching
- specify additional features like relationships and external asset previews (both of which are beyond the scope of this document)
An asset type must exist in the catalog before it can be used for any of the above purposes.
As of this writing there are several asset types available, including the following:
- data_asset
- folder_asset
- policy_transform
- asset_terms
- column_info
- connection
- ai_training_definition
- data_flow
- activity
- notebook
- machine-learning-stream
- dashboard
- data_profile_nlu
You are free to use any of the above asset types. You do not have to, nor are you allowed to, create or over-write any of the above asset types.
Use the Create Asset Type API to create your own asset type. See Asset Type Fields for an overview of the specification of an asset type. See Create Asset Type: book for an example of creating an asset type.
Asset Type Fields
Here is a description for each of the fields in the definition of an asset type. You supply values for these fields when creating an asset type. You will see those same values returned when you get a list of asset types or get a specific asset type.
"name"
:- the name and identifier for the asset type
- should contain only lowercase letters
- will be used in various places in primary metadata documents, including:
- can be used in catalog searches of attribute contents
"description"
: a description for this asset type"fields"
:- an array that contains information for the fields in the corresponding attribute that should be indexed for subsequent searches.
- does not (necessarily) describe all the fields in attributes of this asset type.
- there must be at least one item in this array. In other words, there must be at least one index for an asset type.
- see the following Fields Table for a description of the contents of an item in the
"fields"
array - see "fields" and "properties" Note below
"global_search_searchable"
: an array of field key values denoting fields that should be searchable in Global Search. See a special note on usage here."properties"
:- an object that contains "non-index" information for the fields in the corresponding attribute. This information is typically used by UIs that display/edit assets.
- does not (necessarily) describe all the fields in attributes of this asset type.
- see the following Properties Table for a description of the contents of an item in the
"properties"
object - see "fields" and "properties" Note below
"external_asset_preview"
: beyond the scope of this document"relationships"
: beyond the scope of this document
Note: "fields" and "properties" can, optionally, both be used to describe the exact same field in an attribute. Whether you use "fields"
and/or "properties"
depends on what you want to specify for a field. For example, if you're creating an asset type named "person"
and a person has a field called "birthdate"
(resulting in "entity.person.birthdate"
being present in the primary metadata document) then:
- if you want
birthdate
to be indexed (for searching) then you would include an entry in the"fields"
array forbirthdate
- if you want a UI to understand/display the
birthdate
properly then you would include an entry in the"properties"
object for that samebirthdate
field
See this example which shows both an example "fields"
array and an example "properties"
object.
Key | Description | Example | Required |
---|---|---|---|
key | the name of both the field that will appear in an attribute for this asset type, and the name of the corresponding index for that attribute field | data_asset.mime_type | Yes |
type | the data type of the field being indexed | boolean, or number, or string | Yes |
facets | beyond the scope of this document | true or false | No. Defaults to false. |
search_path | a json path that locates a field in the attribute | See Search Path Examples below. | Yes |
is_searchable_across_types | specifies whether this field can be used in a query without specifying the asset type | true or false | No. Defaults to false. |
Name | Type | Description |
---|---|---|
type | String | Specifies the data type for the property. This value is required. Possible types are: string, number |
description | String | A displayable string to describe the property. |
is_array | boolean | true if the property value is multi-valued (json array). |
required | boolean | true if the property requires a value to be set. |
hidden | boolean | true if the application UI should not display the property or value. |
readonly | boolean | true if the property should not be changed once set. |
default_value | matches the "type" | A value that should be set if no value is provided when the asset attribute is created. |
placeholder | string | A string an application UI can use as a prompt before a value is entered. |
values | array, elements matching "type" | An array of allowed values for the property. Used to describe a limited enumeration or "choice list". |
minimum | integer/number | For an integer or number property, the minimum allowed value. |
maximum | integer/number | For an integer or number property, the maximum allowed value. If both minimum and maximum are specified, minimum must be less than or equal to maximum. |
min_length | integer | For a string property, the minimum allowed length. If specified, must be greater than or equal to zero. |
max_length | integer | For a string property, the maximum allowed length. If specified, must be greater than or equal to zero. If both min_length and max_length are specified, min_length must be less than or equal to max_length. |
properties | object | For a property of type 'object', the recursive definition of the properties, described as in this table. This allows describing nested object-valued properties. |
Search Path Examples
-
See the request body in Create Asset Type: book for an example of where a search path is used in the definition of an asset type.
- Note: when you specify a search path in the definition of an asset type's
"field"
, you only specify the path within the correspondingly named attribute. You needn't specify the attribute name. For example if you have an attribute called"book"
that has a field called"author.last_name"
within it, you only need to specify"author.last_name"
as the search path - not"book.author.last_name"
.
- Note: when you specify a search path in the definition of an asset type's
-
See Search Asset Type: attribute - book for an example of where a search path is used in the body of a search.
- Note: when you specify a search path in the body of search you must specify the name of the attribute being searched. For example if you have an attribute called
"book"
that has a field called"author.last_name"
within it, you would include the name of the attribute in the search path:"book.author.last_name"
.
- Note: when you specify a search path in the body of search you must specify the name of the attribute being searched. For example if you have an attribute called
-
"price"
: a simple path contains just the name of the field to be searched. In this case the attribute being searched should have a simple field called"price"
. -
"tags[]"
: traverse a json array called"tags"
. Because tags[] is not followed by any further names it must be a basic type (e.g. string, boolean, or number), and so its elements will be indexed directly. -
"asset_terms[].name"
: this search path indicates a path starting with a json object named"term_assignments"
at the top, traversing through a json array named asset_terms (you use the [] at the end of the field name to indicate it's an array), landing on another json object that has a field called"name"
. The"name"
field will be indexed. -
"asset_terms[0].name"
: same as above but only the first element in the"asset_terms"
array will be traversed. -
"columns.*.tags[]"
: traverse an object called"columns"
followed by any column name (the '*' indicates a wildcard), followed by a json array called"tags"
. Because tags[] is not followed by any further names it must be a basic type (e.g. string, boolean, or number), and so its elements will indexed directly. -
"column_tags.*[]"
: the json object"column_tags"
contains a series of arrays indicated by *[]. The name of the array object doesn't matter - we want to index it.
Global Search searchable custom attributes
Fields that have been identified as Global Search searchable by being included in global_search_searchable
array by their key, will be synchronized as custom attributes to Global Search microservice, and will become searchable via Global Search. Note that field will only be synchronized to Global Search if field definition contains a valid search_path
. Values found at that search_path
will be synchronized to Global Search, otherwise field will be ignored. Any values provided in global_search_searchable
array which do not correspond to any existing fields of the type will be ignored.
data_asset Type
"data_asset"
is by far the most commonly used already available asset type. It can be seen in:
- the Parts of a Primary Metadata Document Figure
- many of the examples in the Assets API Examples and Asset Types API Examples sections
- the default asset type used when you drag an asset resource file onto the Create Asset page.
The reason "data_asset"
is so popular is that it is a generic asset type that allows you to declare a specific type for a given asset resource without explicitly creating an asset type named after that specific type. For example, say you want to create a primary metadata document for a csv file. You could first create a specific asset type named, say, "csv_file", and then create a primary metadata document (for that csv file) and specify "csv_file" as the value for "metadata.asset_type"
. However, you can avoid creating a specific "csv_file" asset type by instead using the generic "data_asset" asset type and then use the "mime_type" field of the "data_asset" attribute to declare that the specific type of your asset resource is a csv file. To do so, the primary metadata document for the csv file would have:
- a
"metatada.asset_type"
value of the generic type"data_asset"
- a
"entity.data_asset.mime_type"
value of the specific type"text/csv"
.
The fields "asset_type"
and "mime_type"
both describe the "type" of the asset resource. However:
- the type specified by the
"metatada.asset_type"
field (ie,"data_asset"
) is generic - the type specified by the
"entity.data_asset.mime_type"
field (ie,"text/csv"
) is specific
It is the "mime_type"
field of the data_asset
type that allows you to declare a specific type for an asset without creating that specific type(!).
So, in its most basic use, the "data_asset"
asset type is a very "lite" asset type. It's used to avoid creating many other "heavier" asset types. However, if you need to create more complex attributes with indexes for specific fields in your attribute then you will have to create your own asset type (see Create Asset Type: book for an example).
The other two fields of the type "data_asset"
are "dataset"
and "columns"
.
"dataset"
value offalse
means that the"columns"
field is absent in a"data_asset"
attribute"dataset"
value oftrue
means that the"columns"
field is present in a"data_asset"
attribute
The "columns"
field of a "data_asset"
attribute is optionally used to specify metadata for columns of assets that have columns, like csv files, spreadsheets, database tables, etc.
The full definition of the "data_asset"
type is shown in Get Asset Type: data_asset - Response Body.
See Get Asset - CSV File - Response Body - Before Profiling and Get Asset - CSV File - Response Body - After Profiling for examples where a "data_asset"
is used for a csv asset resource.
Get Asset Types
You can get a list of the asset types in a catalog using the following Asset Types API:
Get Asset Types - Request URL:
GET {service_URL}/v2/asset_types?catalog_id={catalog_id}
Get Asset Types - Response Body:
{
"resources": [
{
"description": "Data Asset Type",
"fields": [
{
"key": "dataset",
"type": "boolean",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "mime_type",
"type": "string",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "columns",
"type": "string",
"facet": true,
"is_array": true,
"search_path": "columns[].name",
"is_searchable_across_types": true
}
],
"external_asset_preview": {},
"relationships": [],
"name": "data_asset",
"version": 3
},
"global_search_searchable": [
"mime_type"
],
{
"description": "An asset type you can use to describe the columns of a data asset. Normally attached as a property to an existing data asset.",
"fields": [
{
"key": "column_info_term_display_name",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.column_terms[].term_display_name",
"is_searchable_across_types": true
},
{
"key": "column_info_term_id",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.column_terms[].term_id",
"is_searchable_across_types": false
},
{
"key": "column_info_tag",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.column_tags[]",
"is_searchable_across_types": true
},
{
"key": "column_info_description",
"type": "string",
"facet": false,
"is_array": false,
"search_path": "*.column_description",
"is_searchable_across_types": true
},
{
"key": "column_info_omrs_guid",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "*.omrs_guid",
"is_searchable_across_types": true
}
],
"external_asset_preview": {},
"relationships": [],
"name": "column_info",
"version": 4
},
{
"description": "An asset type that you can use to assign terms from a business glossary to any asset. Attach items of this type as attributes to other assets.",
"fields": [
{
"key": "asset_term_display_name",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "list[].term_display_name",
"is_searchable_across_types": true
},
{
"key": "asset_term_id",
"type": "string",
"facet": true,
"is_array": false,
"search_path": "list[].term_id",
"is_searchable_across_types": false
}
],
"external_asset_preview": {},
"relationships": [],
"name": "asset_terms",
"version": 1
},
...
]
}
See Asset Type Fields for descriptions of the fields in each of the above asset types.
In a scenario in which the user has not yet created any of their own asset types, the result will contain only the pre-existing, global, asset types. For brevity, the actual sample result shown above includes only a subset of those asset types. Try the GET Asset Types
API on your catalog to see the complete set of pre-existing, global, asset types.
Get Asset Type: data_asset
You can get an individual asset type in a catalog using the following Asset Types API:
Get Asset Type: data_asset - Request URL:
GET {service_URL}/v2/asset_types/{type_name}?catalog_id={catalog_id}
Supplying "data_asset" as the value for the {type_name}
parameter in the above url will produce a response like the following:
Get Asset Type: data_asset - Response Body:
{
"description": "Data Asset Type",
"fields": [
{
"key": "mime_type",
"type": "string",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "dataset",
"type": "boolean",
"facet": true,
"is_array": false,
"is_searchable_across_types": false
},
{
"key": "columns",
"type": "string",
"facet": true,
"is_array": true,
"search_path": "columns[].name",
"is_searchable_across_types": true
}
],
"global_search_searchable": [
"mime_type"
],
"external_asset_preview": {},
"relationships": [],
"name": "data_asset",
"version": 3
}
See Asset Type Fields for descriptions of the fields in the above asset type definition.
Since an asset type called "data_asset"
exists, you can create a primary metadata document (ie, card) with a "metadata.asset_type"
value of "data_asset". That card must then also have a primary attribute called "data_asset".
The most interesting item in the "fields"
array in the above "data_asset"
asset type definition is the item with "key"
value "mime_type". That item means that a primary attribute named "data_asset" will have a field called "mime_type"
. The value of that "mime_type"
attribute field will declare the specific type of the asset resource represented by the primary metadata document. For example, see the field "entity.data_asset.mime_type"
in Get Asset - CSV File - Response Body - Before Profiling where the "mime_type"
value is "text/csv".
Notice the "data_asset" attribute in Get Asset - CSV File - Response Body - Before Profiling only contains two fields - "mime_type"
and dataset
. The columns
field specified in the definition of the "data_asset"
asset type is not present in the "data_asset" attribute.
Now compare all the items in the "fields"
array in the above "data_asset"
asset type definition with the "entity.data_asset"
attribute fields as shown, for example, in Get Asset - CSV File - Response Body - After Profiling. Notice that now all the fields described in the "fields"
array of the "data_asset"
type are present as fields in the "entity.data_asset"
attribute. In particular, profiling has added the "columns"
field to the "data_asset" attribute.
The Before Profiling and After Profiling examples illustrate that not all the fields defined in an asset type need be present in a corresponding attribute.
Lastly, note that asset type definition includes a global_search_searchable
list of field keys, including the value mime_type
. That indicates that mime_type
value of every instance of this asset type will be seachable via Global Search microservice.
Create Asset Type: book
Say you have a book asset resource and you want to create a primary metadata document to describe that book. You will first need to create an asset type called "book" (as shown below) so you can then:
- use the name of that asset type as the value for the
"metadata.asset_type"
field in the primary metadata document - create a primary attribute named "book" that will contain data about your book.
Say you want that primary attribute to look like the following:
"book": {
"author": {
"first_name": "Tracy",
"last_name": "Smith"
},
"price": 29.95
}
}
The above "book" attribute has:
- one complex field called "author" (complex fields are allowed in attributes)
- one simple field called "price".
For this example, assume you'll want to be able to search inside the "author.last_name"
field of "book" attributes.
In addition to that, lets assume that you would like to use value of "author.last_name"
field to search for "books" via Global Search microservice.
To create an asset type named "book" that will allow you to do all of the above, use a request like the following:
Create Asset Type: book - Request URL:
POST {service_URL}/v2/asset_types?catalog_id={catalog_id}
Create Asset Type: book - Request Body:
{
"name": "book",
"description": "Book asset type",
"fields": [
{
"key": "author.last_name",
"type": "string",
"facet": false,
"is_array": false,
"search_path": "author.last_name",
"is_searchable_across_types": true
}
],
"global_search_searchable": [
"author.last_name"
],
"properties": {
"price" : {
"type": "number",
"description": "Suggested retail price",
}
}
}
The purpose of most of the fields used in the above request was described in the Asset Type Fields section. Here are some things to note specifically in the above request:
"name"
: uses only lowercase letters, ie, "book""fields"
: even though our goal attribute has multiple fields in it, there is only one item in the asset type's"fields"
array. That is because the"fields"
array should only contain items for the fields of an attribute that we want the catalog to create an index for. In this case, we only want an index for the"author.last_name"
field of "book" attributes."key"
: the name of the attribute field that we want indexed, and the name for that index. In this case,"author.last_name"
."type"
: the type of the"author.last_name"
field is "string""facet"
: an explanation of this field is beyond the scope of this document"is_array"
: false because"author.last_name"
is not an array"search_path"
: this is the path inside the attribute to the value that we want indexed"is_searchable_across_types"
: an explanation of this field is beyond the scope of this document
"global_search_searchable"
since we would like to be able to search forauthor.last_name
value using Global Search - we include correspondingfield.key
value.
Create Asset Type: book - Response Body:
{
"description": "Book asset type",
"fields": [
{
"key": "author.last_name",
"type": "string",
"facet": false,
"is_array": false,
"search_path": "author.last_name",
"is_searchable_across_types": true
}
],
"global_search_searchable": [
"author.last_name"
],
"relationships": [],
"name": "book",
"version": 1
}
The response to the POST /v2/asset_types
API echoes the input, with two additional fields:
- `relationships`: an explanation of the contents of this field is beyond the scope of this document
- `version`: the version of the newly created asset type
You now have an asset type called "book"
that specifies one indexed, search-able, field called "author.last_name"
. See Create Asset: book for an example of the ways in which that "book"
asset type can be used when creating a primary metadata document.
Search Asset Type: attribute - book
The Search Asset Type API can be used to search inside a catalog for all the primary metadata documents that satisfy both of the following conditions:
- have a
"metadata.asset_type"
value that matches the asset type name specified in the {type_name} URL parameter - have an attribute whose fields' values match those specified in the request body.
Recall that one of the primary reasons for creating an asset type is to specify fields in attributes (named after that asset type) that will be indexed for searching. The Create Asset Type: book section showed how to create an asset type named "book"
. The Create Asset: book section showed how to create a primary metadata document whose "metadata.asset_type"
value and primary attribute name are both "book". So, if you use the value "book" for the ``{type_name}` parameter in the URL below, and if you supply the following request body, then you'll get back matching metadata for books.
Search Asset Type: attribute - book - Request URL
POST {service_URL}/v2/asset_types/{type_name}/search?catalog_id={catalog_id}
Search Asset Type: attribute - book - Request Body:
{
"query":"book.author.last_name:Smith"
}
Notice how the query specifies both the attribute (book
) to be searched and the search path (author.last_name
) within that attribute. The value to match is specified after the colon (:
). In this case, the value is Smith
.
The following is the result of the above search:
Search Asset Type: attribute - book - Response Body:
{
"total_rows": 1,
"results": [
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2019-05-01T18:58:51Z",
"last_updater_id": "IBMid-___",
"last_update_time": 1556737131140,
"last_accessed_at": "2019-05-01T18:58:51Z",
"last_access_time": 1556737131140,
"last_accessor_id": "IBMid-___",
"access_count": 0
},
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": [
"getting",
"started",
"documentation"
],
"asset_type": "book",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-___",
"created": 1556635077746,
"created_at": "2019-04-30T14:37:57Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 0,
"asset_state": "available",
"asset_attributes": [
"book"
],
"asset_id": "3da5389d-d4a4-43da-be1f-___",
"asset_category": "USER"
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/3da5389d-d4a4-43da-be1f-___?catalog_id=c6f3cbd8-___"
}
]
}
In this case, there is only one primary metadata document returned in the "results"
array (namely, the primary metadata document that was created in the Create Asset: book section). In general, there can be many matching documents in the "results"
array.
Notice the results of an Asset Type Search, as shown above, only contain the "metadata" section of a primary metadata document. In particular, the "entity" section that contains the attributes is not returned. That is done to reduce the size of the response because, in general, the "entity" section of a primary metadata document can be much larger than the "metadata" section. Use the value of the "metadata.asset_id"
in one of the items in "results"
to retrieve either:
- the entire primary metadata document (using the GET Asset API), or
- just the attributes of the primary metadata document (using the GET Attributes API).
Notes:
- searching is not limited to just primary attributes (like
book
above). Searches may also be performed on:- Secondary, or extended, attributes
- the "metadata" field of a primary metadata document, as shown in the next section.
- other parameters available for searches are:
- limit (number): limit number of search results
- sort (string): sort columns for search results
- counts: beyond the scope of this document
- drilldown: beyond the scope of this document
Search Asset Type: metadata - name
You're not limited to searching within attributes (like the attribute search shown in the previous section). You can also search within the "metadata" section of a primary metadata document.
Search Asset Type: metadata - name - Request URL:
POST {service_URL}/v2/asset_types/{type_name}/search?catalog_id={catalog_id}
Search Asset Type: metadata - name - Request Body:
{
"query":"asset.name:Getting Started with Assets"
}
Notice the query signifies that the search should take place in the "metadata" section of the primary metadata document by using the term asset
at the beginning of the search path. Then the field to be searched within "metadata" is specified - name
in the example above. The value to match is specified after the colon (:
), in this case the value is Getting Started with Assets
.
The following is the result of the above search:
Search Asset Type: metadata - name - Response Body:
{
"total_rows": 1,
"results": [
{
"metadata": {
"rov": {
"mode": 0,
"collaborator_ids": {}
},
"usage": {
"last_updated_at": "2019-04-30T17:27:56Z",
"last_updater_id": "IBMid___",
"last_update_time": 1556645276827,
"last_accessed_at": "2019-04-30T17:27:56Z",
"last_access_time": 1556645276827,
"last_accessor_id": "IBMid___",
"access_count": 0
},
"name": "Getting Started with Assets",
"description": "Describes how to create and use metadata for assets",
"tags": [
"getting",
"started",
"documentation"
],
"asset_type": "book",
"origin_country": "us",
"rating": 0,
"total_ratings": 0,
"catalog_id": "c6f3cbd8-___",
"created": 1556635077746,
"created_at": "2019-04-30T14:37:57Z",
"owner_id": "IBMid-___",
"size": 0,
"version": 0,
"asset_state": "available",
"asset_attributes": [
"book"
],
"asset_id": "3da5389d-d4a4-43da-be1f-___",
"asset_category": "USER"
},
"href": "https://api.dataplatform.cloud.ibm.com/v2/assets/3da5389d-d4a4-43da-be1f-___?catalog_id=c6f3cbd8-___"
}
]
}
In this case, the result is the same as was described in Search Asset Type: attribute - book - Response Body. See that section for more details.
Connections
A connection is the information necessary to create a connection to a data source or a repository. You create a connection asset by providing the connection information.
List data source types
Data sources are where data can be written or read and might include relational database systems, file systems, object storage systems and others.
To list supported data source types, call the following GET method:
GET /v2/datasource_types
The response to the GET method includes information about each of the sources and targets that are currently supported. The response includes a unique ID property value metadata.asset_id
, name, and a label. The metadata.asset_id
property value should be used for the data source in other APIs that reference a data source type. Additional useful information such as whether that data source can be used as a source or target (or both) is also included.
Use the connection_properties=true
query parameter to return a set of properties for each data source type that is used to define a connection to it. Use the interaction_properties=true
query parameter to return a set of properties for each data source type that is used to interact with a created connection. Interaction properties for a relational database might include the table name and schema from which to retrieve data.
Use the _sort
query parameter to order the list of data source type returned in the response.
A default maximum of 100 data source type entries are returned per page of results. Use the _limit
query parameter with an integer value to specify a lower limit.
More data source types than those on the first page of results might be available. Additional properties generated from the page size initially specified with _limit
are returned in the response. Call a GET method using the value of the next.href
property to retrieve the next page of results. Call a GET method using the value in the prev.href
property to retrieve the previous page of results. Call a GET method using the value in the last.href
property to retrieve the last page of results.
These URIs use the _offset
and _limit
query parameters to retrieve a specific block of data source types from the full list. Alternatively, you can use a combination of the _offset
and _limit
query parameters to retrieve a custom block of results.
Create a connection
Connections to any of the supported data source types returned by the previous method can be created and persisted in a catalog or project.
To create a connection, call the following POST method:
POST /v2/connections
A new connection can be created in a catalog or project. Use the catalog_id
or project_id
query parameter to specify where to create the connection asset. Either catalog_id
or project_id
is required.
The request body for the method is a UTF-8 encoded JSON document and includes the data source type ID (obtained in the List data source types
section), its unique name in the catalog or project space, and a set of connection properties specific to the data source. Some connection properties are required.
The following example shows the request body used for creating a connection to IBM dashDB:
{
"datasource_type": "cfdcb449-1204-44ba-baa6-9a8a878e6aa7",
"name":"My-DashDB-Connection",
"properties": {
"host":"dashDBhost.com",
"port":"50001",
"database":"MYDASHDB",
"password": "mypassword",
"username": "myusername"
}
}
By default, the physical connection to the data source is tested when the connection is created. Use the test=false
query parameter to disable the connection test.
A response payload containing a connection ID and other metadata is returned when a connection is successfully created. Use the connection ID as path parameter in other REST APIs when a connection resource must be referenced.
Discover connection assets
Data sources contain data and metadata describing the data they contain.
To discover or browse the data or metadata in a data source, call the following GET method:
GET /v2/connections/{connection_id}/assets?path=
Use the catalog_id
or project_id
query parameter to specify where the connection asset was created. Either catalog_id
or project_id
is required.
connection_id
is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections
method, which created the connection asset.
The path
query parameter is required and is used to specify the hierarchical path of the asset within the data source to be browsed. In a relational database, for example, the path might represent a schema and table. For a file object, the path might represent a folder hierarchy.
Each asset in the assets array returned by this method includes a property containing its path in the hierarchy to facilitate the next call to drill down deeper in the hierarchy.
For example, starting at the root path in an RDBMS will return a list of schemas:
{
"path": "/",
"asset_types": [
{
"type": "schema",
"dataset": false,
"dataset_container": true
}
],
"assets": [
{
"id": "GOSALES",
"type": "schema",
"name": "GOSALES",
"path": "/GOSALES"
},
],
"fields": [],
"first": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"prev": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"next": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=100&_limit=100"
}
}
Drill down into the GOSALES schema using the path
property for the GOSALES schema asset to discover the list of table assets in the schema.
GET /v2/connections/{connection_id}/assets?catalog_id={catalog_id}&path=/GOSALES
The list of table type assets is returned in the response.
{
"path": "/GOSALES",
"asset_types": [
{
"type": "table",
"dataset": true,
"dataset_container": false
}
],
"assets": [
{
"id": "BRANCH",
"type": "table",
"name": "BRANCH",
"description": "BRANCH contains address information for corporate offices and distribution centers.",
"path": "/GOSALES/BRANCH"
},
{
"id": "CONVERSION_RATE",
"type": "table",
"name": "CONVERSION_RATE",
"description": "CONVERSION_RATE contains currency exchange values.",
"path": "/GOSALES/CONVERSION_RATE"
}
],
"fields": [],
"first": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"prev": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=0&_limit=100"
},
"next": {
"href": "https://wdp-dataconnect-ys1dev.stage1.mybluemix.net/v2/connections/4b28b5c1-d818-4ad2-bcf9-7de08e776fde/assets?catalog_id=75a3062b-e40f-4bc4-9519-308ee1b5b251&_offset=100&_limit=100"
}
}
Use the fetch
query parameter with a value of either data
, metadata
, or both. Data can only be fetched for data set assets. In the response above, note the asset_type
has the property type
value of table. Its dataset
property value is true. This means that data can be fetched from table type assets. However, if you fetched assets from the connection root, the response would contain schema asset types, which are not data sets and thus fetching this data is not relevant.
A default maximum of 100 metadata assets are returned per page of results. Use the _limit
query parameter with an integer value to specify a lower limit. More assets than those on the first page of results might be available.
Additional properties generated from the page size initially specified with _limit
are returned in the response. Call a GET method using the value of the next.href
property to retrieve the next page of results. Call a GET method using the value in the prev.href
property to retrieve the previous page of results. Call a GET method using the value in the last.href
property to retrieve the last page of results.
These URIs use the _offset
and _limit
query parameters to retrieve a specific block of assets from the full list. Alternatively, use a combination of the _offset
and _limit
query parameters to retrieve a custom block of results.
Specify properties for reading delimited files
When reading a delimited file using this method, specify property values to correctly parse the file based on its format. These properties are passed to the method as a JSON object using the properties
query parameter. The default file format (property file_format
) is a CSV file. If the file is a CSV, the following property values are set by default:
Property Name | Property Description | Default Value | Value Description |
---|---|---|---|
quote_character |
quote character | double_quote |
double quotation mark |
field_delimiter |
field delimiter | comma |
comma |
row_delimiter |
row delimiter | carriage_return_linefeed |
carriage return followed by line feed |
escape_character |
escape character | double_quote |
double quotation mark |
For CSV file formats, these property values can not be overwritten. If it is necessary to modify these properties to properly read a delimited file, set the file_format
property to delimited
. For generic delimited files, these properties have the following values:
Property Name | Property Description | Default Value | Value Description |
---|---|---|---|
quote_character |
quote character | none |
no character is used for a quote |
field_delimiter |
field delimiter | null | no field delimiter value is set by default |
row_delimiter |
row delimiter | new_line |
Any new line representation |
escape_character |
escape character | none |
no character is used for an escape |
This example sets file format properties for a generic delimited file:
GET https://{service_URL}/v2/connections/{connection_id}/assets?catalog_id={catalog_id}&path=/myFolder/myFile.txt&fetch=data&properties={"file_format":"delimited", "quote_character":"single_quote","field_delimiter":"colon","escape_character":"backslash"}
For more information about this method see the REST API Reference.
Discover assets using a transient connection
A data source's assets can be discovered without creating a persistent connection.
To browse assets without first creating a persistent connection, call the following POST method:
POST https://{service_URL}/v2/connections/assets?path=
This method is identical in behavior to the GET method in the Discover connection assets
section except for two differences:
- You define the connection properties in the request body of the REST API. You do not reference the connection ID of a persistent connection with a query parameter. The same JSON object used to create a persistent connection is used in the request body.
- You do not specify a catalog or project ID with a query parameter.
See the previous section to learn how to set properties used to read delimited files.
For more information about this method see the REST API Reference.
Update a connection
To modify the properties of a connection, call the following PATCH method:
PATCH /v2/connections/{connection_id}
connection_id
is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections
method, which created the connection asset.
Use the catalog_id
or project_id
query parameter to specify where the connection asset was created. Either catalog_id
or project_id
is required.
Set the Content-Type
header to application/json-patch+json
. The request body contains the connection properties to update using a JSON object in JSON Patch format.
Change the port number of the connection and add a description using this JSON Patch:
[
{
"op": "add",
"path": "/description",
"value": "My new PATCHed description"
},
{
"op":"replace",
"path":"/properties/port",
"value":"40001"
}
]
By default, the physical connection to the data source is tested when the connection is modified. Use the test=false
query parameter to disable the connection test.
For more information about this method see the REST API Reference.
Delete a connection
To delete a persistent connection, call the following DELETE method:
DELETE /v2/connections/{connection_id}
connection_id
is the ID of the connection asset returned from the POST https://{service_URL}/v2/connections
method, which created the connection asset.
Use the catalog_id
or project_id
query parameter to specify where the connection asset was created. Either catalog_id
or project_id
is required.
Business Lineage
Introduction
Business Lineage in WKC is designed to parse a new attribute, data_lineage, which is defined on the asset level in Common Assets Managed Services. This attribute is used as primary API between components that provide data sources for Business Lineage and WKC Lineage Service.
General structure of data_lineage attribute:
- List of assets.
- List of flows defined in lineage_relationships section, mapping source assets to target assets.
(Note: All assets referenced as source or target in
lineage_relationships
section, must be defined in assets section, including their context assets, recursively)
Recommendations for representing data flow:
- Describe data flow on the finest-grained level possible, for example on the level of database columns, data file attributes, or fields. If you don't have access to this level of metadata, describe data flow for table-level flows.
- Do not mix hierarchies. Map columns to fields and tables to stages. Do not map, for example, fields to databases.
- Add
data_lineage
attribute to an asset that owns the logic from which lineage information is derived. For example: Database View, Job or Data Transformation.
Types of grouped flows:
- DESIGN: Flows are predicted based on default value parameterization. For some tools, default parameterisation may not exist and DESIGN flows may not be applicable.
- OPERATIONAL: Flows were reported by operational metadata (OMD) - 'what actually happened' for a particular run.
- USER: Flows are relayed by user, not determined automatically.
- SYSTEM: Flows will happen in every run, independent of parameter values or external circumstances. For example, inner job flow.
- OTHER_IMPACT: Flows are not related to data flow. Other forms of impact flows are, for example, bob sequencing or model dependencies
Asset can be referenced in three ways:
- Asset Identity
{
"provider_name": "catalog:e96a4824-ab56-4bad-8176-4d98c313af3d",
"internal_id": "a3",
"lookup": {
"producer_properties": {
"id": "7a84b47f-1891-496c-b225-27602acd8128"
}
}
}
Note: Asset to which data_lineage
is added, can be referenced through word self
intead of asset ID.
Note: internal_id
is internal identifier used for referencing asset only in data_lineage attribute in lineage_relationship
section.
- Resource Key is a global idenifier for CAMS assets. It can be used for referencing asset for lineage flows. For example:
{
"provider_name": "catalog:e5887d45-6c6a-44b6-82ec-8cd953ff2765",
"internal_id": "a1",
"lookup": {
"producer_properties": {
"resource_key": "192.168.1.10/SAMPLE/DB2INST1/EMPLOYEE_TBL"
}
}
}
In case column or member of the asset needs to be referenced and it does not own its ID, then resource key can be extended in the form resourceKey:Column_Name. For example:
{
"provider_name": "catalog:e5887d45-6c6a-44b6-82ec-8cd953ff2765",
"internal_id": "a1",
"lookup": {
"producer_properties": {
"resource_key": "192.168.1.10/SAMPLE/DB2INST1/EMPLOYEE_TBL:EMP_ID_COL"
}
}
}
- Context information
{
"provider_name": "catalog:e5887d45-6c6a-44b6-82ec-8cd953ff2765",
"internal_id": "a3",
"lookup": {
"fqrn": [
{
"database_column": "ID"
},
{
"database_table": "EMP"
},
{
"database_schema": "XYZ"
},
{
"database": "db2"
},
{
"host": "my.abc.com"
}
]
}
}
The following example shows data_lineage
attribute:
{
"name": "data_lineage",
"entity": {
"assets": [
{
"provider_name": "catalog:e5887d45-6c6a-44b6-82ec-8cd953ff2765",
"internal_id": "a1",
"lookup": {
"producer_properties": {
"resource_key": "192.168.1.10/SAMPLE/DB2INST1/EMP_ACT:EMENDATE"
}
}
},
{
"provider_name": "catalog:e5887d45-6c6a-44b6-82ec-8cd953ff2765",
"internal_id": "a2",
"lookup": {
"producer_properties": {
"resource_key": "192.168.1.10/SAMPLE/DB2INST1/EMPACT:EMENDATE"
}
}
}
],
"lineage_relationships": [
{
"flow_type": "flow_design",
"flows": [
{
"sources": [
"a1"
],
"targets": [
"a2"
]
}
]
}
]
}
}
Note: provider_name
field is optional. When not specified, asset container to which this data_lineage
attribute is added, is used as default.
Lineage
Introduction
The lineage of an asset includes information about all events, and other assets, that have led to its current state and its further usage. Asset and Event are the two main entities that are part of the lineage data model. An asset can either be generated from or used in subsequent events. An event can be any of:
- asset-generation-events
- asset-modification-events
- asset-usage-events.
Use the Lineage API to publish events on an asset or to query the lineage of an asset.
Publish a lineage event
The following example shows a sample lineage event that can be posted when a data set is published from a project to a catalog:
Request URL
POST /v2/lineage_events
Request Body
{
"message_version": "v1",
"user_id": "IAM-Id_of_User",
"account_id": "e86f2b06b0b267d559e7c387ceefb089",
"event_details": {
"event_id": "sample-event1",
"event_type": "DATASET_PUBLISHED",
"event_category": [
"additions"
],
"event_time": "2018-04-03T14:01:08.603Z",
"event_source_service": "Watson Knowledge Catalog"
},
"generates_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"asset_type": "DataSet",
"relation": {
"name": "Created"
},
"properties": {
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"name": "Asset Name in Catalog XX",
"catalog_id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
},
"catalog": {
"type": "catalog",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
}
}
}
],
"uses_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"asset_type": "DataSet",
"relation": {
"name": "Used"
},
"properties": {
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"name": "2017_sales_data",
"project_id": "9f9c961a-78d1-4c06-a601-4b589project"
}
},
"project": {
"type": "project",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589project"
}
}
}
}
]
}
Response Body
{
"metadata": {
"id": "01014d1f-31cf-4956-bd41-7a77ba14004c",
"source_event_id": "sample-event1"
}
}
The id generated in the response can be used to query the details of the published event with the following request:
Request URL
GET v2/lineage_events/01014d1f-31cf-4956-bd41-7a77ba14004c
For more details on each field in the lineage event JSON payload, refer to the Lineage Events section of API documentation.
Query lineage of an asset
The lineage of an asset involved in the sample event can be queried using the following request:
Request URL
GET v2/asset_lineages/9f9c961a-78d1-4c06-a601-4b5890fdataset03
Response Body
{
"resources": [
{
"metadata": {
"id": "01014d1f-31cf-4956-bd41-7a77ba14004c",
"source_event_id": "sample-event1",
"created_at": "2018-04-03T14:01:08.603Z",
"created_by": "IAM-Id_of_User"
},
"entity": {
"type": "DATASET_PUBLISHED",
"generates_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"type": "DataSet",
"relation": {
"name": "Created"
},
"properties": {
"catalog": {
"type": "catalog",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
},
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset03",
"name": "Asset Name in Catalog XX",
"catalog_id": "9f9c961a-78d1-4c06-a601-4b589catalog"
}
}
}
}
],
"uses_assets": [
{
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"type": "DataSet",
"relation": {
"name": "Used"
},
"properties": {
"dataset": {
"type": "dataset",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b5890fdataset02",
"name": "2017_sales_data",
"project_id": "9f9c961a-78d1-4c06-a601-4b589project"
}
},
"project": {
"type": "project",
"value": {
"id": "9f9c961a-78d1-4c06-a601-4b589project"
}
}
}
}
],
"properties": {
"event_time": "2018-04-03T14:01:08.603Z",
"event_category": [
"additions"
],
"event_source_service": "Watson Knowledge Catalog"
}
}
}
],
"limit": 50,
"offset": 0,
"first": {
"href": "https://api.dataplatform.cloud.ibm.com/v2/asset_lineages/9f9c961a-78d1-4c06-a601-4b5890fdataset03?offset=0&_=1528182675331"
}
}
Simple Query
The simple query can be invoked like this:
GET /v3/search?query='fred flintstone'&limit=100
With the simple query you can peform simple textual searches using the Lucene syntax. The above query will return items containing fred, flintstone, or both.
Advanced Query
You can use the Global Search api to issue queries using the full capabilities of the Elasticsearch Query Language to search for Catalog assets and Governance artifacts. For details on the structure of an item indexed in global search see below. The advanced query can look something like this:
POST /v3/search -d '
{
"_source":["provider_type_id", "artifact_id", "metadata.name"],
"query": {
"query_string" : { "query" : "flintstone" }
}
}'
The above query returns any items containg the string "flintstone".
Searching with authorization cache
Global search searches across the Cloud Pak for Data platform and restricts search results to content that a user is authorized to view. For faster search results, you can use cached authorization information by setting the auth_cache
parameter to true
. The auth_cache
parameter is set to false
by default to use the most current authorization information.
A simple query using cached authorization information:
GET /v3/search?query='fred flintstone'&limit=100&auth_cache=true
An advanced query using cached authorization information:
POST /v3/search?auth_cache=true -d '
{
"_source":["provider_type_id", "artifact_id", "metadata.name"],
"query": {
"query_string" : { "query" : "flintstone" }
}
}'
Searching with limited authorization scope
Global Search searches across the Cloud Pak for Data platform. For a faster search, you can limit your search scope to certain platform components using the auth_scope
parameter.
For example, to limit the scope of your search to assets within catalogs, you can use auth_scope=catalog
, or to limit your search to assets within projects, you can use auth_scope=project
.
Valid values for the auth_scope
parameter are catalog
, project
, space
, category
and all
(default).
A simple query limiting the scope of the search to catalog only:
GET /v3/search?query='fred flintstone'&limit=100&auth_scope=catalog
An advanced query limiting the scope of the search to catalogs only:
POST /v3/search?auth_scope=catalog -d '
{
"_source":["provider_type_id", "artifact_id", "metadata.name"],
"query": {
"query_string" : { "query" : "flintstone" }
}
}'
Searching for terms in specific fields
The above query searched for the word flintstone
anywhere in an indexed artifact. You can specify which fields to search in, instead of searching throughout the document using the following example:
{
"_source":["provider_type_id", "artifact_id", "metadata.name"],
"query": {
"match" : { "metadata.name" : "flintstone" }
}
}
In the above example, the query is searching for the term flintstone
but only in the metadata.name
field.
Key-Value Search
Use key-value pairs to restrict search within specific properties such as the name, description, tags, column names, terms, custom properties and more
key:value
: This matchesvalue
in the key property.key:"value here"
: This matchesvalue here
in the key property. Quoted value is treated as a whole phrase.key1:value1 key2:value2 key3:value3
: Multiple key value pairs implies an AND between all pairs. The above matches (value1
in thekey1
property) AND (value2
in thekey2
property) AND (value3
in thekey3
property).text before key1:value1 in-between key2:value2 key3:value3 after
: Key-value pairs mixed with regular strings. The above matches (text
ORbefore
ORin-between
ORafter
in any property) AND (value1
in thekey1
property) AND (value2
in thekey2
property) AND (value3
in thekey3
property)
The following properties can be specified as the key.
name: Search within the name of an asset or artifact
desc: Search within the description of an asset or artifact
type: Search by the type of an asset (asset_type) or artifact (artifact_type)
owner: Search by the user ID of the owner of an asset
term: Search in assets and artifacts with the specified business term assigned
tag: Search in assets and artifacts with the specified tag
category: Search for artifacts with the specified primary category
category2: Search for artifacts with the specified secondary category
abbr: Search by the abbreviation of a business term
syn: Search by the synonym of a business term
classification: Search by the classification of an asset or artifact
column: Search with the name of a column in a data asset
columnDesc: Search within the description of a column in a data asset
columnTerm: Search with a business term assigned to a column in a data asset
columnTag: Search with a tag on a column in a data asset
columnDataclass: Search with a data class of a column in a data asset
connection: Search with a connection path of an asset
schema: Search for data assets with the specified schema name
table: Search for data assets with the specified table name
resourceKey: Search with a resource key of an asset
steward: Search by the user ID of the steward of an artifact
Global Search searchable custom attribute names can also be used as a key to restrict search to the specified custom attribute.
Sample query to restrict search to few properties
{
"query":{
"bool":{
"must":[
{
"gs_user_query":{
"search_string":" name:job tag:Toronto",
"nlq_analyzer_enabled": true}
}
]
}
}
}
In this sample query, we want to query on any asset or artifact having job
in the name property AND Toronto
in the tags property.
Sample query to restrict search to a custom attribute
{
"query":{
"bool":{
“must”:[
"nested": {
"path": "custom_attributes",
"query": {
"gs_user_query":{
“search_string” : "book.author.last_name:Smith”,
"nlq_analyzer_enabled": true,
"nested": true
}
}
}
]
}
}
In this sample query, we want to query on assets of type book
, having a Global Search searchable ("global_search_searchable"
) field called author.last_name
with a value of Smith
.
Sample Query with Sort
{
"_source":["provider_type_id", "artifact_id", "metadata.name"],
"query": {
"query_string" : {
"query" : "flintstone"
}
},
"sort": [
{"metadata.modified_on": {"order": "desc","unmapped_type": "date"}}
]
}
The above query will sort the search results based on the date the item was modified.
Sample Query with Aggregation
Here is a sample query of a search for the word flintstone
with an aggregation (a count) of the words that people put in their tags
fields and their terms
fields. See the below for the fields that exist documents indexed in Global Search.
{
"query": {
"query_string" : {
"query" : "flintstone"
}
},
"aggregations" : {
"num_tags" : {"terms" : { "field" : "metadata.tags.keyword" }},
"num_terms" : {"terms" : { "field" : "metadata.terms.keyword" }}
}
}
Nested Queries and Custom Attributes
You can add any number of custom attributes to an item you index with Global Search, and each custom attribute consists of combinations of a name
field, and a value
field.
"custom_attributes": [
{
"last_updated_at": 0,
"attribute_name": "string",
"attribute_value": "string"
}
Because custom attributes normally consist of two fields acting as one, they are nested objects and you must use nested queries to query on those nested objects.
The custom_attributes fields will be included in the response for any result which has custom_attributes. Nested queries are only required to query properties of the custom_attributes, such as custom_attributes.attribute_name or custom_attributes.attribute_value.
Sample Nested Query
In this sample query, we want to query on any asset having a custom attribute named city
having a value of ottawa
, and a second custom attribute named colour
having a value red
. In this example, the city
attribute is treated as a text field, while the colour
attribute will simulate an enumerated list of colours having exact values (i.e. red, blue, green, etc).
{
"_source":["metadata.name", "custom_attributes"],
"query": {
"bool": {
"must": [
{
"nested": {
"path": "custom_attributes",
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{"term": {"custom_attributes.attribute_name": "city"}},
{"match": {"custom_attributes.attribute_value": "Ottawa"}}
]
}
}
]
}
}
}
},
{
"nested": {
"path": "custom_attributes",
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{"term": {"custom_attributes.attribute_name": "colour"}},
{"term": {"custom_attributes.attribute_value.keyword": "red"}}
]
}
}
]
}
}
}
}
]
}
},
"aggs": {
"custom_attr_count": {
"nested": {
"path": "custom_attributes"
},
"aggs": {
"city_count": {
"filter": {
"term": {"custom_attributes.attribute_name": "city"}
},
"aggs": {
"city_count": {
"terms": {
"field": "custom_attributes.attribute_value.keyword",
"size": 20
}
}
}
}
}
}
}
}
In the query body illustrated above, there's a query portion, and an aggregations (aggs) portion. There can be any number of custom attributes. Because we only want counts of city
we must include a filter
in the aggregation so that only attributes whose name is city
are counted. Notice that the count is returned for custom_attribute.attribute_value.keyword
, not custom_attribute.attribute_value
. This is important to note. You cannot sort or aggregate on text
fields. You can only do so on keyword
fields. Every text field in global search has a corresponding keyword field with a .keyword
extension. Use the .keyword
field for things you want to count or sort on. Finally, the size
parameter restricts the number of counts to return to the top 20.
General Purpose Search Function
Global Search provides a general purpose search function that is tailored to the requirements of CloudPak For Data users. You can invoke it using Global Search's Advanced API (see the Methods section below). It is this function that CloudPak for Data uses when a user enters a search term at the top search bar of the CloudPak for Data user interface. You can invoke it anywhere you would normally invoke a normal ElasticSearch search function. For example it can be the main function of your query:
{
"query":{
"gs_user_query":{
"search_string":"The quick red fox jumped over the lazy brown dog"
}
}
}
The search_string
field is required and specifies the search query string.
The following optional parameters can be specified as part of the gs_user_query
:
search_fields
- List of fields that the search will be restricted to. If not specified, the search will run across all fields in the configuration.nlq_analyzer_enabled
- Specifytrue
to enable the natural language analyzer. The default value isfalse
.semantic_expansion_enabled
- Specifytrue
to enable semantic query expansion. The default value isfalse
.nested
- Specifytrue
to optimize nested queries. The default value isfalse
.
{
"query":{
"gs_user_query":{
"search_string": "The quick red fox jumped over the lazy brown dog",
"search_fields": ["metadata.name", "metadata.description"],
"nlq_analyzer_enabled": true,
"semantic_expansion_enabled": true,
"nested": false
}
}
}
This search function will find:
- a single phrase
- multiple individual words
- partial words (within words or at the beginning of words)
- the first letter of a word
If no search fields are specified, the function will search the entire document, including the name
fields, the description
field, tags
, synonyms
, custom attribute values
, column names
, and column descriptions
, etc. It will give the highest priority to the name field of the document.
You can embed gs_user_query
within a compound query:
{
"query":{
"bool":{
"must":[
{"gs_user_query":{"search_string": "the quick red fox jumped over the lazy brown dog"}}
],
"filter":[
{"term":{"provider_type_id":"cams"}}
]
}
},
"sort": [
{"metadata.modified_on": {"order": "desc","unmapped_type": "date"}}
]
}
You can include gs_user_query
with complex queries that include aggregations along with sorts:
{
"query": {
"gs_user_query" : {
"search_string": "fred flintstone"
}
},
"sort" : [
{"metadata.modified_on": {"order": "desc", "unmapped_type": "date"}}
],
"aggregations": {
"first_letter": {
"terms": {
"script": "doc['metadata.name.keyword'].getValue().substring(0,1)",
"order": {
"_key": "asc"
}
},
"aggs": {
"first_letter_group": {
"terms": {
"field": "metadata.name.keyword",
"order": {
"_key": "asc"
}
}
}
}
}
}
}
You can use gs_user_query
in a nested query to search for custom attributes:
{
"query": {
"bool": {
"should": [
{
"gs_user_query": {
"search_string": "quick red fox",
"nlq_analyzer_enabled": true
}
},
{
"nested": {
"path": "custom_attributes",
"query": {
"gs_user_query": {
"search_string": "lazy brown dog",
"nlq_analyzer_enabled": true,
"nested": true
}
}
}
}
]
}
}
Searching for a quoted phrase
Wrap the phrase in quotes within your query as follows:
{
"query":{
"gs_user_query":{
"search_string":"\"The quick red fox jumped over the lazy brown dog\""
}
}
}
The above query will search for exactly the phrase "The quick red fox jumped over the lazy brown dog".
A quoted phrase can also be included in a longer string:
{
"query":{
"gs_user_query":{
"search_string":"The \"quick red fox\" jumped over the \"lazy brown dog\"",
"nlq_analyzer_enabled": true
}
}
}
The above query will search for the phrases quick red fox and lazy brown dog, and will not return results containing only quick, red, fox, lazy, brown, or dog. The query will, however, return results matching the individual words jumped or over.
Searching for words starting with ...
To search for words starting with a letter or letters, enter only the first 1 to 3 letters of the word.
{
"query":{
"gs_user_query":{
"search_string":"in"
}
}
}
The above query will return documents with words like infinite and invitation, but not words like definitive.
Searching for parts of words
If your search terms include more than three letters, then Global Search will search for any partial word matches. For example
{
"query":{
"gs_user_query":{
"search_string":"init"
}
}
}
The above query will find documents with words like initialize (i.e. at the beginning of the word) and trinitoluene (i.e. within the word).
Note: The metadata.description
and entity.assets.column_descriptions
fields are excluded from partial word matching.
Searching with natural language
Natural language analysis can be applied to English search strings to optimize search results in the following ways:
- Words that are not important to the search intent are removed from the search query.
- Phrases in the search string that are common in English are automatically ranked higher than results for individual words.
{
"query": {
"gs_user_query": {
"search_string": "credit card interest in United States",
"nlq_analyzer_enabled": true
}
}
}
The above query will find documents with:
- Matches for the phrases credit card interest and United States ranked highest
- Matches for individual words credit, card, interest, United, and States ranked lower
The above query will not return documents containing only the word in.
Searching with business term semantics
Business terms can be used to express semantic meaning of words and phrases of a business vocabulary. Semantic search leverages this knowledge by expanding search to follow term relationships, making it easier to find semantically similar assets on the platform.
The query below illustrates how to structure a query which uses both natural language analysis and semantic expansion:
{
"query": {
"gs_user_query": {
"search_string": "contact methods for clients",
"nlq_analyzer_enabled": true,
"semantic_expansion_enabled": true
}
}
}
With both natural language analysis and semantic expansion enabled, the search will find business terms matching any of the following:
contact methods for clients
contact methods
clients
Business terms are matched based on their names or abbreviations. For example, CTM
may be defined as an abbreviation of the term contact methods
. If the search string includes CTM
the business term contact methods
will match.
For any matching business terms, free-text search results will be expanded to also include the following business terms as well as any assets associated the following business terms:
- Synonyms of the business term
- For example,
customer
may be defined as a synonym ofclient
- For example,
- Business terms which are a type of the business term or its synonyms.
- For example,
phone number
with typeshome phone
andcell phone
may be defined as a type ofcontact methods
- For example,
- Business terms which are related to the business term
- For example,
preferred method of contact
may be defined as being related tocontact method
- For example,
The expanded search results are driven entirely by the definition of business terms and their relationships.
Result scoring
Results are prioritized using a combination of field priority and type of match as follows:
Type of match:
- Matches for entire phrases will score highest.
- Exact matches of complete words will score next highest.
- If the search term is 3 characters or less results that contain words STARTING with that search term will score next highest.
- Partial matches of complete words will score next highest.
- Fuzzy matches (i.e.
adidas
vsadadas
) will match, but will score lowest.
Field:
- Name
- Synonyms, Abbreviation, Terms, Tags or Classifications
- Description, Primary or Secondary Category
- Column Descriptions, Column Terms, Column Tags or Column Data Class Names
Documents in Global Search
You can query on any of the fields within the document by including the field name in a flattened json structure. For example the field:
{
"entity":{
"artifacts":{
"artifact_id":"<id>"
}
}
}
is queried for by using the following
entity.artifacts.artifact_id
Documents indexed in global search have the following structure:
{
"provider_type_id": "string",
"tenant_id": "string",
"artifact_id": "string",
"last_updated_at": 0,
"metadata": {
"name": "string",
"description": "string",
"artifact_type": "string",
"tags": [
"string"
],
"modified_on": "2021-02-11T11:25:59.384Z",
"modified_by": "string",
"terms": [
"string"
],
"term_global_ids": [
"string"
],
"steward_ids": [
"string"
],
"state": "string",
"classifications": [
"string"
],
"classification_global_ids": [
"string"
]
},
"entity": {
"artifacts": {
"global_id": "string",
"version_id": "string",
"artifact_id": "string",
"rule_type": "string",
"effective_start_date": "2021-02-11T11:25:59.384Z",
"effective_end_date": "2021-02-11T11:25:59.384Z",
"abbreviation": [
"string"
],
"synonyms": [
"string"
],
"synonym_global_ids": [
"string"
],
"enabled": true
},
"assets": {
"catalog_id": "string",
"project_id": "string",
"space_id": "string",
"column_names": [
"string"
],
"column_terms": [
"string"
],
"column_term_global_ids": [
"string"
],
"column_descriptions": [
"string"
],
"connection_paths": [
"string"
],
"column_tags": [
"string"
],
"connection_ids": [
"string"
],
"column_data_class_names": [
"string"
],
"resource_key": "string"
}
},
"custom_attributes": [
{
"last_updated_at": 0,
"attribute_name": "string",
"attribute_value": "string"
}
],
"categories": [
{
"last_updated_at": 0,
"primary_category_id": "string",
"primary_category_global_id": "string",
"primary_category_name": "string",
"secondary_category_ids": [
"string"
],
"secondary_category_global_ids": [
"string"
],
"secondary_category_names": [
"string"
]
}
]
}
]
}
copyright: years: 2019 lastupdated: "2019-02-01"
Methods
List all data quality rules or a subset of them
Get a list of data quality rules in the project.
GET /data_quality/v3/projects/{project_id}/rules
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Query Parameters
The start token of the resource from where the page should begin.
Possible values: 1 ≤ length ≤ 512
Example:
g1AAAAA-eJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYqzmxiYWJiZGYGkOWDSyBJZAPCBD58
The maximum number of resources to return.
Possible values: 1 ≤ value ≤ 200
Default:
200
Example:
20
Comma-separated list of data quality rule identifiers.
Possible values: 1 ≤ length ≤ 10000
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5,b1ba1d22-71a7-4adf-99b2-3c8ba1949710
Response
A collection of data quality rules to be returned.
The maximum number of resources to return.
Possible values: 1 ≤ value ≤ 200
Example:
20
Total number of resources available.
Possible values: 0 ≤ value ≤ 9007199254740991
Example:
100
The link to a page in paginated collection.
A collection of data quality rules.
Possible values: 1 ≤ number of items ≤ 200
The link to a page in paginated collection.
The link to a page in paginated collection.
Status Code
Success.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to get the list of data quality rules in the specified project.
An error occurred. The list of data quality rules could not be returned.
{"total_count":100,"limit":50,"first":{"href":"https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/rules?limit=50"},"rules":[{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf","bound_expression":["TEST.table1.col1<=TEST.table2.col2"],"is_valid":true,"href":"https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/rules/7b3f3a79-6412-480b-a20c-393a3f7addbf","name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
Create data quality rule
Create a data quality rule.
POST /data_quality/v3/projects/{project_id}/rules
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Data quality rule to create.
Example: {"name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}
The name of the data quality rule. The rule name must be unique in the given project. If no unique name is provided, the quality rule will not be created.
Possible values: 1 ≤ length ≤ 200
Example:
address_exists_rule
Data quality rule input details.
The description of the data quality rule. If this property is omitted, no description is set.
Possible values: 1 ≤ length ≤ 5000
Example:
Rule to check address exists.
Identity of a data quality dimension resource. If this property is omitted, no data quality dimension is associated with the resource.
The output details of a data quality rule. If this property is omitted, no output records are saved.
The joins between data assets referenced in bindings and output. This property is not required if the rule is to be run on a single data asset. This property is also not required if a value for
data_stage
is provided.Possible values: 1 ≤ number of items ≤ 50
The sampling options to be used during data quality rule run. If no sampling options are set, the rule is run against all the rows of the source.
Representation of the data stage flow resource to create. If this property is omitted, no subflow is created for the data quality rule.
Response
A data quality rule defines an executable applying a boolean expression on bound columns.
The name of the data quality rule. The name of the quality rule will be unique in a given project.
Possible values: 1 ≤ length ≤ 200
Example:
address_exists_rule
Data quality rule input details.
Resource identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Flag indicating whether the rule is valid or not.
The location URL of a resource.
Possible values: 1 ≤ length ≤ 512
Example:
https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/definitions/7b3f3a79-6412-480b-a20c-393a3f7addbf
The description of the data quality rule. If this property is omitted, no description is set.
Possible values: 1 ≤ length ≤ 5000
Example:
Rule to check address exists.
Identity of a data quality dimension resource.
The output details of a data quality rule.
The joins between data assets referenced in bindings and output. This property is not required if the rule is to be run on a single data asset. This property is also not required if a
data_stage
element is provided.Possible values: 1 ≤ number of items ≤ 50
The sampling options to be used during data quality rule run. If no sampling options are set, the rule is run against all the rows of the source.
Data stage flow details.
Status Code
Success.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to create data quality rules in the specified project.
An error occurred. The data quality rule could not be created.
{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf","bound_expression":["TEST.table1.col1<=TEST.table2.col2"],"is_valid":true,"href":"https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/rules/7b3f3a79-6412-480b-a20c-393a3f7addbf","name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
Delete data quality rules
Delete the data quality rules for the given list of rule identifiers.
DELETE /data_quality/v3/projects/{project_id}/rules
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Query Parameters
Comma-separated list of data quality rule identifiers.
Possible values: 1 ≤ length ≤ 10000
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5,b1ba1d22-71a7-4adf-99b2-3c8ba1949710
The option to delete related output tables when deleting data quality rules.
Default:
false
The option to cancel unfinished jobs before deleting or updating data quality rules.
Default:
false
Response
Status Code
Success.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to delete the data quality rules in the specified project.
An error occurred. The data quality rules cannot be deleted.
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
Validate data quality rule
Check the validity of the data quality rule.
POST /data_quality/v3/projects/{project_id}/validate_rule
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Data quality rule to validate.
Example: {"name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}
The name of the data quality rule. The rule name must be unique in the given project. If no unique name is provided, the quality rule will not be created.
Possible values: 1 ≤ length ≤ 200
Example:
address_exists_rule
Data quality rule input details.
The description of the data quality rule. If this property is omitted, no description is set.
Possible values: 1 ≤ length ≤ 5000
Example:
Rule to check address exists.
Identity of a data quality dimension resource. If this property is omitted, no data quality dimension is associated with the resource.
The output details of a data quality rule. If this property is omitted, no output records are saved.
The joins between data assets referenced in bindings and output. This property is not required if the rule is to be run on a single data asset. This property is also not required if a value for
data_stage
is provided.Possible values: 1 ≤ number of items ≤ 50
The sampling options to be used during data quality rule run. If no sampling options are set, the rule is run against all the rows of the source.
Representation of the data stage flow resource to create. If this property is omitted, no subflow is created for the data quality rule.
Response
A data quality rule defines an executable applying a boolean expression on bound columns.
The name of the data quality rule. The name of the quality rule will be unique in a given project.
Possible values: 1 ≤ length ≤ 200
Example:
address_exists_rule
Data quality rule input details.
Resource identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Flag indicating whether the rule is valid or not.
The location URL of a resource.
Possible values: 1 ≤ length ≤ 512
Example:
https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/definitions/7b3f3a79-6412-480b-a20c-393a3f7addbf
The description of the data quality rule. If this property is omitted, no description is set.
Possible values: 1 ≤ length ≤ 5000
Example:
Rule to check address exists.
Identity of a data quality dimension resource.
The output details of a data quality rule.
The joins between data assets referenced in bindings and output. This property is not required if the rule is to be run on a single data asset. This property is also not required if a
data_stage
element is provided.Possible values: 1 ≤ number of items ≤ 50
The sampling options to be used during data quality rule run. If no sampling options are set, the rule is run against all the rows of the source.
Data stage flow details.
Status Code
The data quality rule is valid.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to validate data quality rules in the specified project.
The data quality rule is invalid. See the error message for the cause.
{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf","bound_expression":["TEST.table1.col1<=TEST.table2.col2"],"is_valid":true,"href":"https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/rules/7b3f3a79-6412-480b-a20c-393a3f7addbf","name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
Get data quality rule
Gets the data quality rule with the given identifier.
GET /data_quality/v3/projects/{project_id}/rules/{id}
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
The data quality rule identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Response
A data quality rule defines an executable applying a boolean expression on bound columns.
The name of the data quality rule. The name of the quality rule will be unique in a given project.
Possible values: 1 ≤ length ≤ 200
Example:
address_exists_rule
Data quality rule input details.
Resource identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Flag indicating whether the rule is valid or not.
The location URL of a resource.
Possible values: 1 ≤ length ≤ 512
Example:
https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/definitions/7b3f3a79-6412-480b-a20c-393a3f7addbf
The description of the data quality rule. If this property is omitted, no description is set.
Possible values: 1 ≤ length ≤ 5000
Example:
Rule to check address exists.
Identity of a data quality dimension resource.
The output details of a data quality rule.
The joins between data assets referenced in bindings and output. This property is not required if the rule is to be run on a single data asset. This property is also not required if a
data_stage
element is provided.Possible values: 1 ≤ number of items ≤ 50
The sampling options to be used during data quality rule run. If no sampling options are set, the rule is run against all the rows of the source.
Data stage flow details.
Status Code
Success.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to get the data quality rule with the given identifier from the specified project.
The data quality rule cannot be found.
An error occurred. The data quality rule with the given identifier cannot be returned.
{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf","bound_expression":["TEST.table1.col1<=TEST.table2.col2"],"is_valid":true,"href":"https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/rules/7b3f3a79-6412-480b-a20c-393a3f7addbf","name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
Delete data quality rule
Delete the data quality rule with the given identifier.
DELETE /data_quality/v3/projects/{project_id}/rules/{id}
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
The data quality rule identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Query Parameters
The option to delete related output tables when deleting data quality rules.
Default:
false
The option to cancel unfinished jobs before deleting or updating data quality rules.
Default:
false
Response
Status Code
Success.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to delete the data quality rule from the specified project.
The data quality rule cannot be found.
An error occurred. The data quality rule cannot be deleted.
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}
Update data quality rule
Updates a data quality rule as specified in the payload details of the update rule request. The updates must be specified by using the JSON patch format, described in RFC 6902.
The following attributes can be patched:
- name (value can only be replaced)
- definition (value can only be replaced)
- description (value can be added, removed, or replaced)
- dimension (value can be added, removed, or replaced)
- input (value can be added, removed or replaced)
- output (value can be added, removed, or replaced)
- joins (value can be added, removed, or replaced)
- sampling (value can be added, removed, or replaced)
- data_stage/propagate_all_incoming_columns (value can be added, removed, or replaced)
PATCH /data_quality/v3/projects/{project_id}/rules/{id}
Request
Path Parameters
The identifier of the project to use.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
The data quality rule identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Query Parameters
The option to cancel unfinished jobs before deleting or updating data quality rules.
Default:
false
The updates to make in the data quality rule.
Example: [{"op":"replace","path":"/description","value":"Column col1 has fewer or the same number of values as column col2"}]
The operation to be performed
Allowable values: [
add
,remove
,replace
,move
,copy
,test
]A JSON pointer to the field to update
A string containing a JSON pointer value
value
Response
A data quality rule defines an executable applying a boolean expression on bound columns.
The name of the data quality rule. The name of the quality rule will be unique in a given project.
Possible values: 1 ≤ length ≤ 200
Example:
address_exists_rule
Data quality rule input details.
Resource identifier.
Possible values: 1 ≤ length ≤ 128
Example:
b1ba1d22-71a7-4adf-99b2-3c8ba19497f5
Flag indicating whether the rule is valid or not.
The location URL of a resource.
Possible values: 1 ≤ length ≤ 512
Example:
https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/definitions/7b3f3a79-6412-480b-a20c-393a3f7addbf
The description of the data quality rule. If this property is omitted, no description is set.
Possible values: 1 ≤ length ≤ 5000
Example:
Rule to check address exists.
Identity of a data quality dimension resource.
The output details of a data quality rule.
The joins between data assets referenced in bindings and output. This property is not required if the rule is to be run on a single data asset. This property is also not required if a
data_stage
element is provided.Possible values: 1 ≤ number of items ≤ 50
The sampling options to be used during data quality rule run. If no sampling options are set, the rule is run against all the rows of the source.
Data stage flow details.
Status Code
Success.
Your authorization to access this method is missing, invalid, or expired.
You do not have permission to update the data quality rule in the specified project.
The data quality rule cannot be found.
An error occurred. The data quality rule was not updated.
{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf","bound_expression":["TEST.table1.col1<=TEST.table2.col2"],"is_valid":true,"href":"https://cloud.ibm.com/data_quality/v3/projects/c19cde3a-5940-4c7a-ad0f-ee18f5f29c00/rules/7b3f3a79-6412-480b-a20c-393a3f7addbf","name":"table1.col1LessOrEqualTable2.col2","description":"The column TEST.table1.col1 has fewer or the same number of values as column TEST.table2.col2","sampling":{"size":2500,"interval":13,"sampling_type":"every_nth"},"output":{"columns":[{"variable_name":"col1","name":"out1","type":"rule_variable","disambiguator":1},{"name":"out2","type":"column","source_column":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}},{"expression":"col1-col2","name":"out4","type":"rule_expression","disambiguator":2},{"metric":"system_time","name":"out5","type":"metric"}],"database":{"records_type":"all_records","update_type":"append","location":{"connection":{"id":"7b3f3a79-6412-480b-a20c-393a3f7addbf"},"schema_name":"TEST","table_name":"output"}},"maximum_record_count":500},"input":{"definitions":[{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":1,"bindings":[{"variable_name":"col1","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col1","type":"column"}}]},{"definition":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"disambiguator":2,"bindings":[{"variable_name":"col2","target":{"data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"column_name":"col2","type":"column"}}]}]},"joins":[{"type":"inner_join","left_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"},"right_data_asset":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93xyz"},"left_column_name":"col1","right_column_name":"col2"}],"dimension":{"id":"ec453723-669c-48bb-82c1-11b69b3b8c93abc"}}
{"trace":"cfkvi16tv9bimp2rqve666na0","status_code":404,"errors":[{"code":"not_found","message":"Requested resource was not found.","more_info":"https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=rules-creating-data","target":{"type":"field","name":"name"}}]}