Querying data directly from the archive

You can query your IBM Cloud Logs archive by using a third-party framework with the standard Apache Parquet reader provided by the relevant framework and required schema.

Archive folder structure

IBM Cloud Logs archive data is stored in standard hive-like partitions with the following partition fields:

team_id=<team-id>: IBM Cloud Logs Team ID
dt=YYYY-MM-DD: Date of the data in UTC
hr=HH: Hour of the data in UTC

These fields can be defined as virtual columns inside the framework and can be used as filters in a query.

Be aware of the following:

Both dt and hr are based on the event timestamp.
The team_id=<team-id> partition lets you reuse the same bucket and prefix to write data from multiple IBM Cloud Logs teams and query them in one query.

Fields

Each Apache Parquet file has three fields with data as JSON-formatted strings:

src_obj__event_metadata: A JSON object containing metadata related to the event.
src_obj__event_labels: A JSON object containing the labels of the event (such as the IBM Cloud Logs applicationName and subsystemName).
src_obj__user_data: A JSON object containing actual event data.

The following is an example of src_obj__event_metadata:

{
  "timestamp": "2022-03-28T08:50:57.946",
  "severity": "Debug",
  "priorityclass": "low",
  "logid": "some-uuid"
}

The following is an example of src_obj__event_labels:

{
  "applicationname": "some-app",
  "subsystemname": "some-subsystem",
  "category": "some-category",
  "classname": "some-class",
  "methodname": "some-method",
  "computername": "some-computer",
  "threadid": "some-thread-id",
  "ipaddress": "some-ip-address"
}

The following is an example of src_obj__user_data:

{
  "_container_id": "0f099482cf3b507462020e9052516554b65865fb761af8e076735312772352bf",
  "host": "ip-10-1-11-144",
  "short_message": "10.1.11.144 - - [28/Mar/2022:08:50:57 +0000] \\"GET /check HTTP/1.1\\" 200 16559 \\"-\\" \\"Consul Health Check\\" \\"-\\""
}

Archive queries and parsing

With the IBM Cloud Logs Archive Query feature you can query logs directly from your archive using Lucene, DataPrime, and regex query syntax without counting against your daily quota, even if the data was never indexed. You can store more of your data in the Analyze and alert and Store and search pipelines and take advantage of IBM Cloud Logs real-time analysis and remote storage search capabilities. This means you can use a shorter retention period and still be able to quickly query all your data.

Archive queries run on the archive that you set in IBM Cloud Logs and are available for all TCO logging levels. For example, if you prioritize logs for the Analyze and alert pipeline you can still query them without indexing the data. You can also view and query them in the LiveTail, receive real-time alerts and notification of anomalies, use parsing rules, log aggregation, and events to metrics at a lesser cost than data sent to the Priority insights pipeline.