Supporting dashboards
IBM® watsonx.data Presto (Java) offer comprehensive observability through a robust set of dashboards that provide visibility into performance metrics, enabling rapid issue diagnosis and optimizing resource allocation.
The following are the supported dashboards:
- System health
- Query performance health
- Data and metadata health
- Workload health
- Query latency health
- Query lifecycle health
- Anomaly and trend insights
- Log and error health
The Grafana tool provides support only for the following four dashboards System health, Query performance health, Data and metadata health, and Workload health while the Instana tool supports all the eight dashboards.
The following list represents the default set of Presto (Java) metrics. Users can extend this by adding additional metrics as needed. A full list of available metrics and their definitions can be found in Presto exposed JMX metrics.
System health
Monitoring the underlying infrastructure is paramount for Presto. Focuses on the foundational infrastructure, monitoring core resources like CPU, memory, and I/O to detect bottlenecks and ensure stable operations.
Presto (Java) engine:
-
CPU usage - Monitors CPU usage across Presto instances.
process_cpu_seconds_total
-
Memory usage - Tracks total memory used versus available
watsonx_data_presto_cluster_memory_manager_cluster_memory_byteswatsonx_data_presto_cluster_memory_manager_leaked_byteswatsonx_data_presto_memory_heap_memory_usage_committed_byteswatsonx_data_presto_memory_heap_memory_usage_max_byteswatsonx_data_presto_memory_non_heap_memory_usage_committed_byteswatsonx_data_presto_memory_non_heap_memory_usage_max_byteswatsonx_data_presto_cluster_memory_manager_cluster_user_memory_reservationwatsonx_data_presto_cluster_memory_manager_cluster_total_memory_reservationwatsonx_data_presto_cluster_memory_manager_queries_killed_due_to_out_of_memoryjvm_memory_bytes_committed
-
Presto memory pool - Tracks memory consumption within the reserved and general memory pools of Presto.
watsonx_data_presto_memory_pool_general_max_byteswatsonx_data_presto_cluster_memory_pool_general_nodeswatsonx_data_presto_memory_pool_general_free_byteswatsonx_data_presto_memory_pool_general_reserved_byteswatsonx_data_presto_cluster_memory_pool_general_free_distributed_byteswatsonx_data_presto_cluster_memory_pool_general_total_distributed_byteswatsonx_data_presto_cluster_memory_pool_general_reserved_distributed_byteswatsonx_data_presto_cluster_memory_pool_general_reserved_revocable_distributed_bytes
-
Alluxio cache - Tracks the efficiency and usage of cached data in Alluxio during queries.
watsonx_data_presto_alluxio_cache_bytes_read_cache_countwatsonx_data_presto_alluxio_cache_bytes_requested_external_countwatsonx_data_presto_alluxio_cache_written_cache_external_countwatsonx_data_presto_alluxio_cache_get_errors_countwatsonx_data_presto_alluxio_cache_put_errors_countwatsonx_data_presto_alluxio_cache_pages_countwatsonx_data_presto_alluxio_cache_pages_evicted_countwatsonx_data_presto_alluxio_cache_space_available_valuewatsonx_data_presto_alluxio_cache_space_used_value
To generate Alluxio cache metrics, the Alluxio cache must be enabled. For more information, refer to Enhancing the query performance through caching
Additionally, ensure the following configurations are included in the jvm.config file:
-Dalluxio.metrics.key.including.unique.id.enabled=true
-Dalluxio.user.app.id=presto
- Fragment cache - Tracks usage and hit or miss rates of cached query fragments in Presto.
watsonx_data_presto_fragment_cache_stats_cache_entrieswatsonx_data_presto_fragment_cache_stats_cache_hitwatsonx_data_presto_fragment_cache_stats_cache_removalwatsonx_data_presto_fragment_cache_stats_cache_size_in_byteswatsonx_data_presto_fragment_cache_stats_inflight_bytes
Query performance
Understanding query behavior is critical for a query engine. Query performance metrics include:
Presto (Java) engine:
-
Currently running queries - Monitors the query request rate.
watsonx_data_presto_query_manager_running_queries
-
Query execution time - Tracks query latency.
watsonx_data_presto_query_manager_execution_time_five_minutes_p99
-
Data processed - Measures data transfer rates during query execution.
watsonx_data_presto_task_manager_input_data_size_five_minute_countwatsonx_data_presto_task_manager_output_data_size_five_minute_count
-
Error rates - Indicates the percentage of queries that error out when the system is under stress.
watsonx_data_presto_query_manager_user_error_failures_five_minute_countwatsonx_data_presto_query_manager_abandoned_queries_five_minute_countwatsonx_data_presto_query_manager_canceled_queries_five_minute_count
-
Successful vs failed requests - Tracks successful vs failed request counts.
watsonx_data_presto_query_manager_completed_queries_five_minute_countwatsonx_data_presto_query_manager_failed_queries_five_minute_countwatsonx_data_presto_query_manager_internal_failures_five_minute_countwatsonx_data_presto_task_manager_failed_tasks_five_minute_count
Data and metadata health
For a system dealing with vast amounts of data, the health of data ingestion and metadata management is crucial.
Presto (Java) engine:
-
Data ingestion - Query manager - Monitors the volume and rate of data being ingested into the system.
watsonx_data_presto_query_manager_consumed_input_bytes_five_minute_countwatsonx_data_presto_query_manager_consumed_input_rows_five_minute_countwatsonx_data_presto_query_manager_wall_input_bytes_rate_five_minutes_p90
-
S3 Object store errors - Tracks failure metrics while reading data from S3 or object storage layers.
watsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_errors_total_countwatsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_countwatsonx_data_presto_hive_s3_presto_s3_file_system_other_read_errors_total_count
-
Queue metric - Measures the size and processing rate of internal data processing queues.
watsonx_data_presto_dispatch_manager_queued_querieswatsonx_data_presto_split_scheduler_stats_mixed_split_queues_full_and_waiting_for_source_five_minute_countwatsonx_data_presto_task_executor_processor_executor_queued_task_countwatsonx_data_presto_task_executor_split_queued_time_all_time_maxwatsonx_data_presto_task_executor_split_queued_time_all_time_avg
-
File metadata cache metrics - Observes hit/miss rates and efficiency of the metadata cache for file access.
watsonx_data_presto_hive_cache_stats_mbean_parquet_metadata_hit_ratewatsonx_data_presto_hive_cache_stats_mbean_parquet_metadata_sizewatsonx_data_presto_hive_cache_stats_mbean_orc_file_tail_sizewatsonx_data_presto_hive_cache_stats_mbean_orc_file_tail_hit_ratewatsonx_data_presto_hive_cache_stats_mbean_stripe_footer_sizewatsonx_data_presto_hive_cache_stats_mbean_stripe_stream_size
Workload health
Understanding how different workloads interact with the system is key to resource optimization.
Presto (Java) engine:
-
Workload count - Indicates the number of currently running queries.
watsonx_data_presto_query_manager_running_queries
-
Status - Indicates whether the workload is active, idle, or failed.
watsonx_data_presto_query_manager_completed_queries_five_minute_countwatsonx_data_presto_query_manager_abandoned_queries_five_minute_countwatsonx_data_presto_query_manager_canceled_queries_five_minute_countwatsonx_data_presto_query_manager_failed_queries_five_minute_count
-
Error rates - Error rates
watsonx_data_presto_query_manager_user_error_failures_five_minute_countwatsonx_data_presto_query_manager_failed_queries_five_minute_countwatsonx_data_presto_query_manager_external_failures_five_minute_countwatsonx_data_presto_query_manager_internal_failures_five_minute_countwatsonx_data_presto_query_manager_insufficient_resources_failures_five_minute_count
-
Resource utilization - Tracks CPU, memory, and disk usage associated with each workload.
watsonx_data_presto_query_manager_consumed_cpu_time_seconds_five_minute_countwatsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p25watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p50watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p75watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p90
-
Request count - Total number of workload execution requests received over a period of time.
watsonx_data_presto_query_manager_running_querieswatsonx_data_presto_dispatch_manager_queued_queries
Query lifecycle health
It provides insight into each stage of a query’s journey from submission to execution helping to identify bottlenecks in queuing, task execution, and completion.
Presto (Java) engine:
-
Errors in each stage - Tracks query execution failures across different stages of the Presto instances by identifying the problem areas in the pipeline where tasks are failing.
watsonx_data_presto_task_manager_failed_tasks_five_minute_count
-
Resource utilization per query - Captures system resource usage per query, including threads, splits, and queued or executing queries.
watsonx_data_presto_task_executor_running_tasks_level0watsonx_data_presto_task_executor_running_splitswatsonx_data_presto_query_manager_submitted_queries_five_minute_countwatsonx_data_presto_query_manager_queued_querieswatsonx_data_presto_dispatch_manager_queued_querieswatsonx_data_presto_task_executor_blocked_splits
-
Executor pool health - Monitors the internal thread pool used to run Presto tasks.
watsonx_data_presto_task_executor_processor_executor_pool_sizewatsonx_data_presto_task_executor_processor_executor_active_countwatsonx_data_presto_task_executor_processor_executor_completed_task_countwatsonx_data_presto_task_executor_processor_executor_queued_task_count
-
Split CPU time - Tracks CPU time consumed by leaf and intermediate splits.
watsonx_data_presto_task_executor_intermediate_split_cpu_time_countwatsonx_data_presto_query_manager_consumed_cpu_time_seconds_five_minute_countwatsonx_data_presto_task_executor_leaf_split_cpu_time_p99
Query latency health
Focuses on the execution phase of queries identifying latency sources and the impact of query complexity.
Presto (Java) engine:
-
Latency (ms) - Measures execution time across various stages.
watsonx_data_presto_query_manager_execution_time_five_minutes_p99watsonx_data_presto_task_executor_split_wall_time_one_minute_maxwatsonx_data_presto_task_executor_split_wall_time_fifteen_minutes_maxwatsonx_data_presto_task_executor_split_wall_time_all_time_p99watsonx_data_presto_task_executor_leaf_split_wall_time_p99
-
Request Volume - Tracks task, split, and scheduling activity.
watsonx_data_presto_task_executor_split_queued_time_five_minutes_countwatsonx_data_presto_split_scheduler_stats_get_split_time_five_minutes_p99watsonx_data_presto_task_executor_split_wall_time_five_minutes_count
Log and error health
Monitors errors and failures across the query execution pipeline, highlighting system stability and failure patterns.
Presto (Java) engine:
-
Query failure rate - Tracks execution failures and bottlenecks.
watsonx_data_presto_query_manager_failed_queries_five_minute_countwatsonx_data_presto_query_manager_internal_failures_five_minute_countwatsonx_data_presto_query_manager_user_error_failures_five_minute_countwatsonx_data_presto_task_executor_split_skipped_due_to_memory_pressure_five_minute_count
-
Service/component affected - Hive S3 / FileSystem - Identifies failing Presto components.
watsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_countwatsonx_data_presto_hive_s3_presto_s3_file_system_aws_retry_count_fifteen_minute_countwatsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_errors_total_countwatsonx_data_presto_hive_s3_presto_s3_file_system_socket_timeout_exceptions_total_count
-
Service/component affected - Task Executor - Identifies failing Presto components.
watsonx_data_presto_task_executor_split_wall_time_all_time_max_errorwatsonx_data_presto_task_executor_blocked_quanta_wall_time_all_time_max_errorwatsonx_data_presto_task_executor_leaf_split_cpu_time_max_errorwatsonx_data_presto_task_executor_intermediate_split_wall_time_max_errorwatsonx_data_presto_task_executor_unblocked_quanta_wall_time_one_minute_max_errorwatsonx_data_presto_task_executor_split_queued_time_one_minute_max_errorwatsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_countwatsonx_data_presto_hive_s3_presto_s3_file_system_aws_retry_count_fifteen_minute_countwatsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_errors_total_countwatsonx_data_presto_hive_s3_presto_s3_file_system_socket_timeout_exceptions_total_count
-
Severity level - Categorizes metrics by impact severity.
-
Severe (Critical)
watsonx_data_presto_query_manager_internal_failures_five_minute_countwatsonx_data_presto_task_executor_split_wall_time_all_time_max_errorwatsonx_data_presto_task_executor_blocked_quanta_wall_time_all_time_max_errorwatsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_countwatsonx_data_presto_cache_stats_quota_exceeded
-
Moderate (Warning)
watsonx_data_presto_query_manager_user_error_failures_five_minute_countwatsonx_data_presto_hive_s3_presto_s3_file_system_aws_retry_count_fifteen_minute_ratewatsonx_data_presto_hive_s3_presto_s3_file_system_get_object_errors_fifteen_minute_ratewatsonx_data_presto_hive_s3_presto_s3_file_system_read_retries_fifteen_minute_rate
-
Low (Info)
watsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_retries_five_minute_countwatsonx_data_presto_task_executor_split_skipped_due_to_memory_pressure_total_countwatsonx_data_presto_task_executor_processor_executor_shutdown
-
Anomaly and trend insights
Highlights unexpected patterns or deviations in query behavior, helping detect performance degradation or improvements.
Presto (Java) engine:
-
Latency drift - Tracks evolving query latencies.
watsonx_data_presto_task_executor_split_wall_time_fifteen_minutes_avgwatsonx_data_presto_task_executor_leaf_split_wait_time_avgwatsonx_data_presto_task_executor_intermediate_split_wall_time_avg
-
Error rate vs baseline - Compares recent execution metrics to historical baselines.
watsonx_data_presto_task_executor_split_wall_time_fifteen_minutes_avgwatsonx_data_presto_task_executor_leaf_split_wait_time_avgwatsonx_data_presto_task_executor_intermediate_split_wall_time_avg
-
Throughput drop detector - Detects dips in data processing rates.
watsonx_data_presto_task_executor_global_scheduled_time_micros_five_minute_ratewatsonx_data_presto_hive_s3_presto_s3_file_system_successful_uploads_five_minute_rate
-
Query duration - Measures average execution time per task or split.
watsonx_data_presto_task_executor_leaf_split_cpu_time_avgwatsonx_data_presto_task_executor_intermediate_split_cpu_time_avg
-
Memory trend - Tracks memory usage and potential leaks.
jvm_memory_bytes_usedwatsonx_data_presto_memory_heap_memory_usage_used_bytes
-
Workload trend comparison - Compares resource usage across time windows.
watsonx_data_presto_task_executor_global_cpu_time_micros_total_countwatsonx_data_presto_cluster_memory_manager_cluster_total_memory_reservationwatsonx_data_presto_task_executor_blocked_quanta_wall_time_fifteen_minutes_avg