Using regex when querying data

Regular expressions (regex) can be used when querying IBM Cloud Logs data for pattern searching and string replacement.

You might want to extract specific data from your logs to make it easier to analyze and visualize. Sometimes you might want to capture specific logged data. Other times, you might need to hide sensitive data in logs before they are saved.

You can also match using a regex pattern rather than an exact text search.

What is regex and how does it work?

Regular Expressions, also known as regex, is a domain-specific language (DSL) used for pattern searches and replacements.

The information in this topic is not intended to give a full educational tutorial on regex. If you are unfamiliar with regex, you might want to review publicly available information on regex before trying to understand the specific concepts included here.

Regex concepts

There are some basic concepts you need to know to understand the examples in this topic.

Capture group: Regex contained within parenthesis. The operators are applied to the text matching the specification within the parenthesis.
Named capture group: A capture group that is associated with a name. The matched results can be referenced by the name
Character class: A range of characters to be matched enclosed in square brackets ([]). A dash can be used as a shorthand to list several characters: [1-5] is the same as [12345].

When to use regex

There are times when you will not need a regular expression and you can just search for a specific text. For example, if you just want to find log lines with the text user logged in, you can simply enter this text into your log search. But if your log lines look like this: user_32 logged in you can’t search for the exact text since the user ID is a variable that changes.

Fortunately, there is a regex pattern for this case which makes use of the match anything sequence:

user_d+ logged in

Using regex to extract text into custom JSON fields

Suppose you have an unstructured log in the following format:

${logLevel}: World-${worldName}: ${logText}

And, you would like to convert all entries in this format into a JSON object in the following format:

{
    "level": `${log-level}`,
    "tag": `World-${worldName}`,
    "text": `${logText}`
}

You can use this regex to do the conversion:

^(?P.*?):s*(?P.*?):

Where:

^ is the start of line symbol, it means that the match needs to start with the start of a line.
(?PX) where X is what is matched and is the named capturing group syntax. This regex has three capturing groups, one for each JSON key needed.
s means “any whitespace character“.
. means “any one character”.
* means “0 or more matches of the previous symbol or character”.
.* means “1 or more matches of the previous symbol or character”.
.*? means “any characters any number of times, with the least amount of tokens necessary“.

So, the regex:

^(?P.*?):s*(?P.*?):

Will be processed as follows:

Text starting at the start of a line until the first : symbol will be captured as the group level.
After any number of whitespaces, any text until the next : symbol will be captured as the group tag.
The log text is automatically set to the text field by IBM Cloud Logs, so we will automatically have the text field.

Using the same regex, the following log:

"info: World-w-8: generate: new world"

Is converted into this JSON object:

{
   "level": "info",
   "tag": "World-w-8",
   "text": "info: World-w-8: generate: new world"
}

Extracting text into predefined fields

You might want to extract text into predefined fields. Consider this log:

"info: World-w-8: generate: new world"

You might want to extract the info text into the Severity column, and the World text into the Class column.

You can modify the regex to set correct names of the capturing groups as follows:

^(?P.*?):s*(?P.*?)-(?P.*?):

This regex will format the log so it will display as shown.

Example view showing formatted log — Log display after applying regex formatting

Extracting specific data from structured logs

A similar method can be used to extract data from structured logs. The following is an example showing you how to extract data from a specific JSON field.

Suppose you have a structured log line similar to this:

{
    "type": `${text}`,
    "log": `${text}`,
    "region": "rg-europe-2"
}

Now, if the region field has the following form:

`rg-${"europe"|"asia"|"na"}-${number}`

We want to extract the part which tells us whether the region is europe, asia or na. The regex to extract the data would be:

"region"s*:s*"rg-(?P.*?)-

The named capturing group regionName is what extracts the text. The region name is after the key name region and the characters rg- according to our format. The purpose of the s* symbols is to make regex still work if there are any whitespaces before or after the : symbol.

The result will be similar to the following:

{
    "log" : "Bye",
    "regionName" : "na",
    "region" : "rg-na-1",
    "type" : "ltest-w-9"
}

Replacing and removing values

One of the most common examples where we need to replace or remove values is hiding personal data. Suppose you log phone numbers somewhere and don’t want those to be saved in IBM Cloud Logs.

Suppose we have an unstructured log line like this:

"info: Sender: sendSms: sending sms to phone number +12345678910 to user Andrew"

You want to remove phone number and name from this line. This regex will match the line, starting with "sending sms":

sending sms to phone number +*d+ to user .*

We need to escape the + symbol a blank space because + has a special meaning in the regex syntax. This meaning is “1 or more previous characters”.

The symbol d matches any single digit. Remember, that the * symbol means “0 or more previous characters”. So, +*d+ matches one or more digits which can be prepended by a + symbol (or not).

This regex will replace the text matched by the previous regex to the same text, but without the phone number and name:

sending sms to phone number * to user *

And here is the result of applying the above rule:

"info: Sender: sendSms: sending sms to phone number * to user *"

Replacing JSON values in structured logs

Replacing JSON values in structured logs is similar to replacing values in unstructured logs. The whole JSON string is used as the input for the Replace rule.

Using the following JSON structure:

{
   "type": `${text}`,
   "log": `${text}`,
   "region": "rg-europe-2"
}

Suppose you need to replace the value of the type field with another value. Here is how you can match the value of the type field:

"type"s*:s*".*?"

Remember that we need the s* symbols to make sure the regex works if there are whitespaces before or after the : character.

Here is the regex to replace any value in the type field with newType:

"type":"newType"

Using backreferences

Suppose we need to replace europe with eu in the strings in the format west-europe-2. And suppose we need to do this not only in the region field but in any other part of the log where the string is found. Matching this pattern is easily done with this regex:

.+?-europe-d+

Howerver, replacing the string using the methods we used before might be rather hard. This is because we need to insert two strings before and after the europe text, and those strings might vary. To do this, we first need to capture the strings into capturing groups:

(.+?)-europe-(d+)

Remember that the d symbol means “any digit” and together with + it means “any digit one or more times”. In a similar way, .+ means any symbol one or more times, but with the fewest amount of tokens to make the match.

This regex will capture the text before europe as capturing group 1, and the text after europe as capturing group 2. The following regex will use backreferences to insert the matched content of those groups:

$1-eu-$2

For example, a log before applying our rule:

{
    "log" : "Here region is east-europe-1. That's it",
    "type" : "newType",
    "region" : "east-europe-1"
}

And the log after applying the rule:

{
    "log" : "Here region is east-eu-1. That's it",
    "type" : "newType",
    "region" : "east-eu-1"
}

Searching using regex from the logs search bar

Another place where you can use regex is the IBM Cloud Logs UI Logs page search bar.

When searching from the page search bar you can enter exact text, Lucene, or Dataprime queries. You can also look for particular patterns using regex. Regex queries have their own format which is:

/${fieldName}.keyword:/REGEX//

Suppose we have many JSON-structured logs in the following format:

{    
    "log":  `${text}` ,
    "regionName":  `${text}`,
    "region":  `${text}`, 
    "type"  :  `ltest-w-${number}`
}

And we want to match only those entries where type is equal to ltest-w-1, ltest-w-2 or ltest-w-3. The following search query will do this:

/type.keyword:/ltest-w-[1-3]//

The text between square brackets [ and ], is called a character class. It matches any one character listed between square brackets. Dash can be used as a shorthand to list several characters: [1-5] is the same as [12345].

You can use the same syntax to match data in unstrucutred logs with regex. You only need to set the field name to text.keyword. For example:

 /text.keyword:/.*ltest-w-[1-3].*//

This regex will search for the text ltest-w-1, ltest-w-2 or ltest-w-3 in main log body.

Triggering alerts with regex

Another popular use case for regex in IBM Cloud Logs is in defining alerts. the alerts syntax is the same as logs query syntax.

Let’s say we want to alert us on a line of the following form:

`App: init: World-${name}: generation error: ${err}`

And suppose we want to get alerts only for worlds w-1, w-2, w-3 or w-4. Our alert regex will look like this:

/text.keyword:/.*World-w-[1-4]: generation error.*//

Remember, that [1-4] matches any single character from 1 to 4, and .* matches any characters any number of times.

For more information about alerts, see: