IBM Cloud Docs
Overview

Overview

IBM Cloud® Data Engine is deprecated. As of 18 February 2024 you can't create new instances, and access to free instances will be removed. Existing Standard plan instances are supported until 18 January 2025. Any instances that still exist on that date will be deleted.

Data Engine is a fully managed service that runs SQL queries (that is, SELECT statements) to read, analyze, transform, store, and stream data in IBM Cloud® Object Storage and Kafka. You can also manage table metadata in a catalog that is compatible with Hive metastore. Data Engine is IBM Cloud®'s central service for data lakes. Combining Data Engine with data in Object Storage enables you to create an active workspace for a range of big data analytics use cases.

Data Engine overview.
Figure 1. Data Engine overview

Input data is read from CSV, JSON, ORC, Parquet, or AVRO objects located in one or more Cloud Object Storage instances. Each query result is written to a CSV, JSON, ORC, Parquet, or AVRO object in a Cloud Object Storage or Db2 instance of your choice. Use the Data Engine user interface (UI) to develop your queries and the Data EngineREST API to automate them.

Input and output of queries

Before you can use the Data Engine service to run SQL queries, the input data must be uploaded to one or more Cloud Object Storage instances. You must also have at least 'Writer' access to at least one Cloud Object Storage bucket, so that result objects (that is, the objects that contain output data) can be written there. For more information about Cloud Object Storage, including how to provision an instance, create buckets, and upload data, see the Cloud Object Storage Getting Started Guide and Reading and writing to Cloud Object Storage. You can also write to databases and take advantage of index management.

Programmatic access

REST API

You can use the Data Engine service REST API to run queries and retrieve information about their status. This is especially helpful when you write code that automatically queries data.

Note: The Cloud Resource Name (CRN) is a mandatory part of a Data Engine REST endpoint call. The CRN Copy button copies your CRN to clipboard and you can paste it into your API call.

Python applications and Notebooks

For a Python application, you can also use the ibmcloudsql package. Use IBM Watson Studio to run queries with Data Engine and visualize the query results with one of the various widget libraries available in Watson Studio.

Cloud functions

Data Engine is a serverless mechanism to submit SQL queries, making it a natural match for the serverless IBM Cloud Functions. You can use the generic SQL Cloud function to run Data Engine as an IBM Cloud function.

Geospatial functions

The Geospatial Toolkit provides a set of geospatial functions that you can use to efficiently process and index spatial data. These functions are integrated into the Data Engine service and ready for immediate use. The Data Engine service also provides several sample queries that illustrate how to use these functions.

Required user roles

The following table shows which user roles are required to start a particular service action or API endpoint. Use this information to decide which access rights to grant your users when you create new user IDs.

Table 5. User roles
Description Service action API endpoint Required user roles
Submit an SQL query. sql-query.api.submit POST/v2/sql_jobs/ Manager or Writer
Get information for all submitted jobs. sql-query.api.getalljobs GET/v2/sql_jobs/ Manager, Writer, or Reader
Get information for a specific submitted job. sql-query.api.getjobinfo GET/v2/sql_jobs/{job_id} Manager, Writer, or Reader
Submit a catalog or index management statement. sql-query.api.managecatalog POST/v2/sql_jobs/ Manager

Behavior of scanned data

Data Engine reads as little data as possible based on your query. The amount of data that is scanned depends on the amount of data that Data Engine must read to run your query, and not on the actual size of your data. Several factors play a role when it comes to how much data needs to be accessed to run a query. First, data layout is important. Columnar formats, such as Parquet, lead to less data to be scanned, as Data Engine can selectively read ranges and single columns. Furthermore, the actual object layout determines how many objects need to be scanned. Read How to lay out big data in IBM Cloud Object Storage for Spark SQL for more details on how to lay out big data on Cloud Object Storage to improve cost and performance of SQL queries. Each successful query is charged with at least 10 MB.

Example

Assume you have 1 PB of data that is stored on Cloud Object Storage that is laid out as described in the blog post and is optimized for the queries you want to run. If you run a single query, the most expensive query possible is SELECT * FROM, as reading 1 PB of data is required. Any other query is much cheaper and faster. For example, a 1 PB data set consists of audit events for users of a system (user A performed action B in system X at time T) and the data is laid out in a way that it is partitioned by time (one file per day and system). So to answer a query like SELECT DISTINCT user FROM WHERE System='X' AND Day >= (TODAY - 30), Data Engine must access all objects for system X that contain data for the last 30 days. The sum of the size of these objects is the maximum estimate of data that is scanned that you would be charged for. But as Data Engine accesses only one field, and data is stored as Parquet, it is much less. Calculating the precise price of the query is not possible in advance because much of it depends on the data itself. Parquet, for example, stores compressed columns, so if the column can be compressed effectively, even less data needs to be read. You also find some further details in the blog post Data Engine releases serverless transformation and partitioning of data in open formats about Data Engine ETL capabilities and how they affect scanned data.

Timestamps

Values of the timestamp data type are created with Coordinated Universal Time (UTC) zone by default. So, for instance, the expressions timestamp('2009-07-30 04:17:52'), to_timestamp('2016-12-31', 'yyyy-MM-dd'), or current_timestamp all results in a Coordinated Universal Time timestamp value and the input string expressions are assumed to be in Coordinated Universal Time.

If you want to create a Coordinated Universal Time timestamp from a string expression that represents a different time zone, use to_utc_timestamp, as in to_utc_timestamp('2016-08-31', 'Asia/Seoul').

You can also create timestamp values in a different time zone from a Coordinated Universal Time timestamp value, or a Coordinated Universal Time string expression that is using from_utc_timestamp, as in from_utc_timestamp(current_timestamp, 'Asia/Seoul'), or from_utc_timestamp('2016-08-31', 'Asia/Seoul').

Limitations

  • If a JSON, ORC, or Parquet object contains a nested or arrayed structure, a query with CSV output that uses a wildcard (for example, SELECT * from cos://...) returns an error such as "Invalid CSV data type used: struct<nested JSON object>." Use one of the following workarounds:

    • For a nested structure, use the FLATTEN table transformation function. Alternatively, you can specify the fully nested column names instead of the wildcard, for example, SELECT address.city, address.street, ... from cos://....
    • For an array, use the Spark SQL explode() function, for example, select explode(contact_names) from cos://....
  • If you receive a corrupted result, verify that the source URI is correct and that the correct input format is specified, by using 'STORED AS' in the SQL statement.

  • If you receive an error message that states that some columns are not found in the input columns, but the columns do exist in the input, check if the input format that is specified as 'STORED AS' in the SQL statement is the actual format of your input.

  • To process CSV input with Data Engine, each row must be contained within one line. Multi-line values are not supported.

    If you use Data Engine to generate CSV results from other data formats like Parquet that support newlines within values and these CSV results are queried again, newlines must explicitly be removed before you write the results. To do so, use the SQL function regexp_replace. For example, a Parquet object data has an attribute multi_line containing values that span multiple lines. To select a subset of rows based on a condition and store it on Cloud Object Storage for further processing, a skeleton SQL statement looks like the following example:

    SELECT regexp_replace(multi_line, '[\\r\\n]', ' ') as multi_line FROM data STORED AS parquet WHERE condition