Apache Gluten accelerated Spark engine
Apache Gluten accelerated Spark engine is an optimized, high-performance engine in watsonx.data. The Spark engine uses Apache Gluten for offloading SQL execution to Velox, which is an open source execution engine(implemented in C++) thereby accelerating the computation of SparkSQL to reduce the cost for running the workloads.
Apache Gluten serves as a native engine plugin designed to accelerate Spark SQL and DataFrame workloads, by integrating native engines with SparkSQL, leveraging the high scalability of the Spark SQL framework and the superior performance of native engines. It is fully compatible with the Apache Spark API, ensuring that no code changes are necessary.
Features of Apache Gluten accelerated Spark engine
-
Supports file formats Apache Parquet and Apache Avro.
-
Improved table scan performance.
-
Accelerates larger SQL queries with joins and aggregation.
-
Supports Delta, Hudi, Iceberg and Hive catalogs.
-
Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, including wide tables that contain thousands of columns.
-
Replaces sort-merge joins with hash-joins by default.
For more information about using Spark engine, see Working with watsonx.data Spark.
Supported Spark version for Apache Gluten accelerated Spark engine
IBM® watsonx.data supports the following Spark runtime versions to run Spark workloads by using Apache Gluten accelerated Spark engine.
| Name | Status | Release date | End-of-support date |
|---|---|---|---|
| Apache Spark 3.4.4 | Deprecated | JAN 2025 | JUNE 2026 |
| Apache Spark 3.5.4 | Supported | JUNE 2025 | FEB 2028 |
| Apache Spark 4.0 | Supported | APR 2026 | AUG 2028 |
Limitations
-
Smaller queries are not accelerated.
-
Catalog association is only supported for s3 object stores.
-
Fallbacks to JVM
-
ANSI: Apache Gluten currently does not support ANSI mode. If ANSI is enabled, Spark plan's execution will always fall back to vanilla Spark.
-
FileSource format: Currently, Apache Gluten only support Parquet file format and partially support ORC. If other format is used, scan operator falls back to vanilla spark.
-
-
Incompatible behavior
-
JSON functions: Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, Apache Gluten gives incorrect result.
-
Parquet read configuration: Apache Gluten supports
spark.files.ignoreCorruptFileswith default false, if true, the behavior is same as config false. Apache Gluten ignoresspark.sql.parquet.datetimeRebaseModeInRead, it only returns what write in parquet file. It does not consider the difference between legacy hybrid (Julian Gregorian) calendar and Proleptic Gregorian calendar. The result may be different with vanilla spark. -
Parquet write configuration: Spark has
spark.sql.parquet.datetimeRebaseModeInWriteconfig to decide whether legacy hybrid (Julian + Gregorian) calendar or Proleptic Gregorian calendar should be used during parquet writing for dates/timestamps. If the Parquet to read is written by Spark with this config as true, Velox's TableScan will output different result when reading it back.
-
-
Spill
OutOfMemoryException may still be triggered within current implementation of spill-to-disk feature, when shuffle partitions is set to a large number. When this case happens, please try to reduce the partition number to get rid of the OOM.
-
CSV Read
-
The header option should be true. And now we only support DatasourceV1, i.e., user should set spark.sql.sources.useV1SourceList=csv. User defined read option is not supported, which will make CSV read fall back to vanilla Spark in most case. CSV read will also fall back to vanilla Spark and log warning when user specifies schema is different with file schema.
-
For more detailed info on Apache Gluten limitations, see [Limitation](https://github.com/apache/incubator-Apache Gluten/blob/main/docs/velox-backend-limitations.md).
-
Best Practices for Apache Gluten
-
Apache Gluten requires Large OffHeap memory as integrates with native backend i.e., Velox using Apache Arrow's OffHeap columnar format, which is essential for large-scale data processing without exceeding JVM heap limits.
-
If not provided with enough offHeap memory, could lead to potential OOMs (see Limitation). Therefore, by deault 75% of executor memory is set to offHeap.
-
User’s could adjust the percentage of memory set for offHeap, using the configuration,
ae.spark.offHeap.factor, which accepts values (0-1) exclusive, eg: 0.75. -
It is recommended for users to set this to lower value i.e., < 0.5 if their workload has lot of fallbacks to Java based spark. (see Limitation) so that OOM does not happen while falling back to Java based spark execution.
-
It is recommended to set this to higher value. i.e., 0.75 and above if their workloads executes natively on Apache Gluten without fallback.
By default the value set is 0.75.