DataPrime examples

DataPrime can be used to query and transform log data in IBM Cloud Logs. This topic is intended to let you get started quickly with DataPrime by providing examples of various types of queries and data transformation.

For a detailed reference of DataPrime operators and expressions, see the DataPrime reference.

DataPrime supports data and helper types, expressions, and operators.

Data and helper types

DataPrime supports the following data types:

boolean: >, <, =, true or false
number/num: 42
string: “foo”
timestamp: 10-10-2021:11:11

DataPrime also supports several helper types, used as arguments for functions or operators, including:

Interval: duration of time
Regexp: standard regular expressions

Expressions

Expressions consist of literals, functions, methods, operators, cast, and groupings that return data.

literals: 42
functions: length($.foo)
methods: $d.foo.length() This is an alternative syntax for functions
operator expressions: 1 + $d.foo
cast: $d.foo:string
groupings: (…)

Operators

Use operators as commands that transform unstructured streams of JSON data and creates a new stream for a specified time period. Examples include: filter, extract, sortby, groupby, orderby, find, and choose.

For more information about DataPrime operators, see the DataPrime reference.

Formatting queries

Queries are formatted as follows:

source logs | operator1 ... | operator2 ... | operator3 | ...

Any whitespace between operators is ignored. Queries can be written on multiple lines:

source logs
  | operator1 ....
  | operator2 ....
  | operator3 ....

Examples of frequently used operators

The examples in this section use the following log records as input.

{ region: "us-east-1", az: "us-east-1a", duration_ms: 231, result: 'success' }
{ region: "us-east-1", az: "us-east-1b", duration_ms: 2222, result: 'failure' }
{ region: "eu-west-1", az: "eu-west-1a", duration_ms: 501, result: 'success' }
{ region: "eu-west-2", az: "eu-west-2a", duration_ms: 23, result: 'success' }

Filter examples

Get only the logs which have a result of success:

source logs
| filter result == 'success'

# Result - Full raw logs

Get only the logs which have a result of failure in the eu-west-1 region:

# Option 1

source logs
| filter result == 'failure' && region == 'eu-west-1'

# Option 2
source logs
| filter result == 'failure'
| filter region == 'eu-west-1'

# Result - Full raw logs

Get only the logs which have a region that starts with eu- :

source logs
| filter region.startsWith('eu-')

# Result - Full raw logs

Get only the logs which have a duration greater than 2 seconds:

source logs
| filter duration_ms / 1000 > 2

# Result - Full raw logs

Get all the logs, except the ones which are in the ap-southeast-1 region:

# Option 1
source logs
| filter region != 'ap-southeast-1'

# Option 2
source logs
| filter !(region == 'ap-southeast-1')

# Option 3
source logs
| block region == 'ap-southeast-1'

# Result - Full raw logs

Get 10 success logs which have a duration larggreaterer than 2 seconds:

source logs
| filter result == 'success' && duration_ms / 1000 > 2
| limit 10

# Result - Full raw logs

Order the success logs by descending duration returning the highest duration 10 logs:

source logs
| filter result == 'success'
| orderby duration_ms desc
| limit 10

# Result - Full raw logs

Do a free-text search on the msg field, returning only the logs which have the word started in them. Combine the free-text search with the filter to stream to stdout:

# Query 1 -  Using the find operator for finding the text in msg, and then filtering using the filter operator

source logs
| find 'started' in msg
| filter stream == 'stdout'

# Results 1 - Full raw logs

# Query 2 - Using the ~ predicate and combining the free-text search and the filter into one expression that is passed to filter
source logs
| msg ~ 'started' && filter stream == 'stdout'

# Result - Full raw logs

Do a wild-search of text inside the entire log:

# Option 1 - Using wildfind operator

source logs
| wildfind 'mycompany'

# Option 2 - Inside an expression

source logs
| filter $d ~~ 'mycompany'

# Result - Full raw logs

Use a full Lucene query to filter results:

source logs
| lucene 'region:"us-east-1" AND "mycompany"'

Convert the data type of a key. Then get the logs whose version field contains a value greater than 32.

Consider this input data:

{ "version" : "12", ... }
{ "version": "17", ... }
{ "version": "65", ... }

# Option 1 - By casting

source logs
| filter version:number > 32

# Option 2 - Using convert operator

source logs
| convert version:number
| filter version > 32

Get the logs with a result of success, but select only result and duration fields for the output:

# Option 1 - Using the choose operator

source logs
| filter result == 'success'
| choose result, duration_ms

# Option 2 - Using the select operator, which is just an alias for choose

source logs
| filter result == 'success'
| select result, duration_ms

# Result - Only result and duration keys will remain in each event

Choose examples

Get success logs, but choose only result and duration fields for the output.

# Option 1 - Using the choose operator

source logs
| filter result == 'success'
| choose result, duration_ms

# Result - Only result and duration keys will remain in each event

Construct a new object using choose. The output fields are:

Outcome contains the value of the result field from the original log.
Duration_seconds contains the original duration_ms divided by 1000 to convert it to seconds.
A new field called meaning_of_life which contains the value 42.

source logs
| choose result as outcome, duration_ms / 1000 as duration_seconds, 42 as meaning_of_life

# Result - Notice the key names have been changed according to the "as X" parts
{ "outcome": "success", "duration_seconds": 2.54, "meaning_of_life": 42 }
{ "outcome": "failure", "duration_seconds": 0.233, "meaning_of_life": 42 }

Count/Countby examples

Count all the success logs:

source logs
| filter result == 'success'
| count

# Result - Total number of logs after filtering

Count logs, grouped by success/failure:

source logs
| countby result

# Result - Number of logs per result value
{ "result": "success", "_count": 847 }
{ "result": "failure", "_count": 22 }

Count logs, grouped by success or failure in a region region, with the results in a new field named request_count:

source logs
| countby region,result into request_count

# Result - Notice that the count keyname is set to request_count because of "into request_count"
{ "region": "eu-west-1", "result": "success", "request_count": 287 }
{ "region": "eu-west-1", "result": "failure", "request_count": 2 }
{ "region": "eu-west-2", "result": "success", "request_count": 2000 }
{ "region": "eu-west-3", "result": "success", "request_count": 54 }
{ "region": "eu-west-3", "result": "failure", "request_count": 2 }

Count events in each region, and return the 3 regions with the greatest results:

source logs
| top 3 region by count()

# Result - 3 rows, each containing a region and a count of logs

Average the duration in seconds for each region, and return the lowest 3 regions:

source logs
| bottom 3 regions by avg(duration_ms)

# Result - 3 rows, each containing a region and an average duration

Groupby examples

Get the average and maximum durations for successes or failures:

# Option 1 - Output keypaths are named automatically

source logs
| groupby result calc avg(duration_ms),max(duration_ms)

# Result 1
{ "result": "success", "_avg": 23.4, "_max": 287 }
{ "result": "failure", "_avg": 980.1, "_max": 1000.2 }

# Option 2 - Using "as X" to name the output keypaths

source logs
| groupby result calc avg(duration_ms) as avg_duration,max(duration_ms) as max_duration

# Result 2
{ "result": "success", "average_duration": 23.4, "max_duration": 287 }
{ "result": "failure", "average_duration": 980.1, "max_duration": 1000.2 }

When querying with the groupby operator, you can apply an aggregation function (such as avg, max, sum) to the bucket of results. This feature gives you the ability to manipulate an aggregation expression inside the expression itself, letting you to calculate and manipulate your data simultaneously. DataPrime aggregation functions can be found in the DataPrime reference.

Distinct examples

Get distinct regions from the data, grouping the logs by region name without aggregations.

Consider the following input:

# Input Examples:
{ "region": "us-east-1", ... }
{ "region": "us-east-1", ... }
{ "region": "eu-west-1", ... }

# Query 1 - Get distinct regions from the data

source logs
| distinct region

# Results 1 - distinct region names
{ "region": "us-east-1" }
{ "region": "eu-west-1" }

Enrich examples

Enrich and filter your logs using additional context from a lookup table. For example, enrich user activity logs with the user’s department and then retrieve logs of all users in the Finance department.

First you need to upload the lookup table. In the IBM Cloud Logs UI, click the Data flow icon Data Flow icon > Data Enrichmrent. Add a custom enrichment.

There are two possible ways to enrich your logs:

Select log key to look up for a key value and enrich the logs automatically during ingestion. The logs are saved with the enriched fields. The advantages of this mode are:
- Logs are automatically enriched.
- The logs include the enrichment data, which makes it easier to consume everywhere. Consumption can include by any query and also by third-party products that read the logs from the bucket.
Use the DataPrime enrich query to look up a value in the his table and enrich the log dynamically for the purpose of the query. The advantages of this mode are:
- You can enrich old log data already ingested into IBM Cloud Logs.
- The enrichment does not increase the size of the stored logs, since the enrichment is done dynamically, and used only for the query results.

enrich <value_to_lookup> into <enriched_key> using <lookup_table>

The <value_to_lookup> is the name of a log key or the actual value that will be looked up in the custom enrichment <lookup_table> and a key called <enriched_key> will be added to the log with all table columns as sub-keys. If the <value_to_lookup> is not found in the <lookup_table>, the <enriched_key> will still be added but with “null” values to maintain the same structure for all result logs. You can then filter the results using the DataPrime capabilities, such as filtering logs by specific value in the enriched field.

For example, using the following original log:

{
  "userid": "111",
  ...
}

And, using the Custom Enrichment lookup table called “my_users”:

ID	Name	Department
111	John	Finance
222	Emily	IT

Running the following query:

enrich $d.userid into $d.user_enriched using my_users

Returns the following enriched log:

{
  "userid": "111",
  "user_enriched": {
    "ID: "111",
    "Name": "John",
    "Department": "Finance"
  },
  ...
}

Consider the following when using enrich:

Run the DataPrime query source <lookup_table> to view the enrichment table.
If the original log already contains the enriched key:
- If <value_to_lookup> exists in the <lookup_table>, the sub-keys will be updated with the new value.
- If the <value_to_lookup> does not exist, the current value will be retained.
- Any other sub-keys which are not columns in the <lookup_table> will retain their existing values.
All values in the <lookup_table> are considered to be strings. This means that:
- The <value_to_lookup> must be in a string format.
- All values are enriched in a string format. You can then convert the values to your preferred format (for example, JSON, timestamp) using the appropriate functions.

Extract example

Extract lets you take semi-structured text and extract meaningful data out of it. There are multiple methods to extract this data:

regexp – Extract using regular expression capture-groups
jsonobject – Take a stringified JSON and extract it to a real object
kv – Extract key=value pairs from a string

The extracted values can also be converted to their real data type as part of the extraction. This is done by adding the datatypes clause that contains the required conversions. This is the same syntax as the convert operator.

Extract information from a text field using a regular expression:

This example uses this input data:

{ "msg": "... Query_time: 2.32 Lock_time: 0.05487 ..." }
{ "msg": "... Query_time: 0.1222 Lock_time: 0.0002 ..." }
...

# Query 1

source logs
// Filter the relevant logs using lucene
| lucene '"Query_time:"'
// Extract duration and lock_time strings from the msg field
| extract msg into stats using regexp(
  /# Query_time: (?<duration>.*?) Lock_time: (?<lock_time>.*?) /)
// Choose to leave only the stats object that the extraction has created
| choose stats

# Results 1 - Output contains strings
{ "stats": { "duration": "0.08273" , "lock_time": "0.00121" } }
{ "stats": { "duration": "0.12" , "lock_time": "0.001" } }
{ "stats": { "duration": "3.121" , "lock_time": "0.83322" } }
...

# Query 2 - Added datatypes clause, so the extracted values will be numbers instead of strings

source logs
| lucene '"Query_time:"'
| extract msg into stats using regexp(
  /# Query_time: (?<duration>.*?) Lock_time: (?<lock_time>.*?) /)
  datatypes duration:number,lock_time:number
| choose stats

# Results 1 - Output contains real numbers and not strings (see previous example)
{ "stats": { "duration": 0.08273 , "lock_time": 0.00121 } }
{ "stats": { "duration": 0.12 , "lock_time": 0.001 } }
{ "stats": { "duration": 3.121 , "lock_time": 0.83322 } }
...

# Query 3 - Use the extracted values in a later operator, in this case a filter

source logs
| lucene '"Query_time:"'
| extract msg into stats using regexp(
  /# Query_time: (?<duration>.*?) Lock_time: (?<lock_time>.*?) /)
  datatypes duration:number,lock_time:number
| choose stats
// Filter for only the logs which contain a lock_time which is greater than 0.5
| filter stats.lock_time > 0.5

# Results 1 - Output contains real numbers
{ "stats": { "duration": 3.121 , "lock_time": 0.83322 } }
...

Extract a JSON object stored in a string:

This example uses this input data:

{"my_json": "{\\"x\\": 100, \\"y\\": 200, \\"z\\": {\\"a\\": 300}}" , "some_value": 1}
{"my_json": "{\\"x\\": 400, \\"y\\": 500, \\"z\\": {\\"a\\": 600}}" , "some_value": 2}
...

# Query 1

source logs
| **extract my_json into my_data using jsonobject()**

# Results 1
{
  "my_json": "..."
  "my_data": {
    "x": 100,
    "y": 200,
    "z": 300
  }
  "some_value": 1
}
{
  "my_json": "..."
  "my_data": {
    "x": 400,
    "y": 500,
    "z": 600
  }
  "some_value": 2
}

# Query 2 - Additional filtering on the resulting object

source logs
| extract my_json into my_data using jsonobject()
| **filter my_data.x = 100**

# Results 2 - Only the object containing x=100 is returned
Extract key=value data from a string. Notice that the kv extraction honors quoted values.

This example uses this input data:

{ "log": "country=Japan city=\\"Tokyo\\"" , ... }
{ "log": "country=Israel city=\\"Tel Aviv\\"" , ... }
...

# Query 1

source logs
| extract log into my_data using kv()

# Results 1
{
  "log": "..."
  "my_data": {
    "country": "Japan"
    "city": "tokyo"
  }
  ...
}
{
  "log": "..."
  "my_data": {
    "country": "Israel"
    "city": "Tel Aviv"
  }
  ...
}

This example uses this input data:

# Example Data for Query 2 - Key/Value delimiter is ":" and not "="
{ "log": "country:Japan city:\\"Tokyo\\"" , ... }
{ "log": "country:Israel city:\\"Tel Aviv\\"" , ... }
...

# Query 2

source logs
| extract log into my_data using kv(':')

# Results 2 - Same results as query 1

Orderby/Sortby examples

Order the success logs by descending duration returning the highest-duration 10 logs:

source logs
| filter result == 'success'
| orderby duration_ms desc
| limit 10

# Result - Full raw logs

Numerically order a string field which contains numbers:

This example uses this input data:

{ "error_code": "23" }
{ "error_code": "12" }
{ "error_code": "4" }
{ "error_code": "1" }

# Query 1

source logs
| orderby error_code:number

# Results 1 - Ordered by numeric value
{ "error_code": "1" }
{ "error_code": "4" }
{ "error_code": "12" }
{ "error_code": "23" }

# Query 2 - By using the convert operator

source logs
| convert error_code:number
| orderby error_code

# Results 2 - Same results

Order multiple fields by alphabetical order.

Ordering is case-sensitive; A-Z will be ordered before a-z.

This example uses this input data:

{ "last_name": "smith" , "first_name": "jane" }
{ "last_name": "doe", "first_name": "john" }
...

# Query
source logs
| orderby last_name,first_name

The following is the same example but with case-insensitive ordering:

source logs
| orderby toLowerCase(last_name),toLowerCase(first_name)

Create example

Create a new keypath value.

This example uses this input data:

{ "country": "Japan", "city": "Tokyo" }
{ "country": "Israel", "city": "Jerusalem" }
...

# Query 1

source logs
| create default_temperature from 32.5

# Results 1 - Each log contains the new field, with the same value
{ "country": "Japan", "city": "Tokyo", "default_temperature": 32.5 }
{ "country": "Israel", "city": "Jerusalem", "default_temperature": 32.5 }
...

# Query 2 - Create a new field which contains the first three letters of the country, converted to uppercase

source logs
| create country_initials from toUpperCase(substr(country,1,3))

# Results 2
{ "country": "Japan", "city": "Tokyo", "country_initials": "JAP" }
{ "country": "Israel", "city": "Jerusalem", "country_initials": "ISR" }

This example uses this input data:

{ ... , "temp_in_fahrenheit": 87.2 }
{ ... , "temp_in_fahrenheit": 32 }
...

source logs
| create temp_in_celcius from (temp_in_fahrenheit - 32) * 5 / 9

# Results 3
{ ... , "temp_in_fahrenheit": 87.2, "temp_in_celcius": 30.66666 }
{ ... , "temp_in_fahrenheit": 32, "temp_in_celcius": 0.0 }
...

Create a new field containing <country>/<city> as a string. Uses string-interpolation syntax

source logs
| create country_and_city from `{country}/{city}`

# Results 3
{ "country": "Japan", "city": "Tokyo", "country_and_city": "Japan/Tokyo" }
{ "country": "Israel", "city": "Jerusalem", "country_and_city": "Israel/Jerusalem" }

Move example

This operator can be used to move a source keypath to a target keypath, including entire subtrees. It can also be used to rename a keypath.

For nested source keypaths, only the actual key is move, The target keypath is merged with any other objects or keys which already exist in the data.

When moving an entire subtree, the target keypath will serve as the root of the new subtree.

Move a key to a target location.

This example uses this input data:

{ "query_id": "AAAA", "stats": { "total_duration": 23.3 , "total_rows": 500 }}
...

# Query 1 - Rename a keypath

source logs
| move query_id to query_identifier

# Results 1 - Keypath renamed
{ "query_identifier": "AAAA", "stats": { "total_duration": 23.3 , "total_rows": 500 } }
...

# Query 2 - Move a key to an existing subtree

source logs
| move query_id to stats.query_id

# Results 2 - query_id moved into "stats"
{ "stats": { "total_duration": 23.3 , "total_rows": 500, "query_id": "AAAA" } }
...

Move a subtree to another location.

# Query 1 - Rename subtree

source logs
| move stats to execution_data

# Results 1 - Rename subtree
{ "query_identifier": "AAAA", "execution_data": { "total_duration": 23.3 , "total_rows": 500 } }
...

# Query 2 - Move subtree to root

source logs
| move stats to $d

# Results 2
{ "query_id": "AAAA", "total_duration": 23.3 , "total_rows": 500 }
...

This example uses this input data:

{ "request": { "id": "1000" } , "user": { "name": "james", "id": 50 } }
...

# Move subtree to another subtree

source logs
| move user to request.user_info

# Results - Entire user subtree moved into request.user_info
{ "request": { "id": "1000", "user_info": { "name": "james", "id": 50 } } }
...

Replace examples

Replace the value in an existing keypath:

This example uses this input data:

{ "user": { "id": "1000" , "name": "James", "email": "james@mycompany.com" } }
{ "user": { "id": "2000" , "name": "John", "email": "john@mycompany.com" } }
...

# Example 1

source logs
| replace user.name with 'anyone'

# Results 1
{ "user": { "id": "1000" , "name": "anyone", "email": "james@mycompany.com" } }
{ "user": { "id": "2000" , "name": "anyone", "email": "john@mycompany.com" } }
...

# Example 2

source logs
| replace user.name with user.email

# Results 2
{ "user": { "id": "1000" , "name": "james@mycompany.com", "email": "james@mycompany.com" } }
{ "user": { "id": "2000" , "name": "john@mycompany.com", "email": "john@mycompany.com" } }

# Example 3

source logs
| replace user.name with `UserName={user.id}`

# Results 3
{ "user": { "id": "1000" , "name": "UserName=1000", "email": "james@mycompany.com" } }
{ "user": { "id": "2000" , "name": "UserName=2000", "email": "john@mycompany.com" } }
...

Remove examples

This example uses this input data:

{
  "stats": {
    "duration_ms": 2.34,
    "rows_scanned": 501,
    "message": "Operation has taken 2.34 seconds"
  },
  "some_value": 1000
}
...

# Query 1 - Remove the message keypath

source logs
| remove stats.message
# Results 1
{
  "stats": {
    "duration_ms": 2.34,
    "rows_scanned": 501
  },
  "some_value": 1000
}
...

# Query 2 - Remove the entire stats subtree

source logs
| remove stats

# Results 2
{
  "some_value": 1000
}
...

Redact examples

This example uses this input data:

{ "serverIp": "ip-172-30-20-12.eu-west-1.compute.internal", ... }
{ "serverIp": "ip-172-82-121-1.eu-west-2.compute.internal", ... }
{ "serverIp": "ip-172-99-72-187.us-east-1.compute.internal", ... }
...

# Query 1 - Redact all parts containing the string '.computer.internal'

source logs
| redact serverIp matching 'compute.internal' to ''

# Results 1
{ "serverIp": "ip-172-30-20-12.eu-west-1", ... }
{ "serverIp": "ip-172-82-121-1.eu-west-2", ... }
{ "serverIp": "ip-172-99-72-187.us-east-1", ... }

# Query 2 - Redact all digits before aggregation using regexp

source logs
| redact serverIp matching /[0-9]+/ to 'X'
| countby serverIp

# Results 2
{ "serverIp": "ip-X-X-X-X.eu-west-X.compute.internal", "_count": 2323 }
{ "serverIp": "ip-X-X-X-X.us-east-X.compute.internal", "_count": 827 }
...

Source examples

Set the data source that your DataPrime query is based on.

The syntax is:

source <data_store>

Where <data_store> can be:

logs
spans
The name of a custom enrichment. In this case, the command will display the custom enrichment table.

source logs