IBM Cloud Docs
Supporting dashboards

Supporting dashboards

IBM® watsonx.data Presto (Java) offer comprehensive observability through a robust set of dashboards that provide visibility into performance metrics, enabling rapid issue diagnosis and optimizing resource allocation.

The following are the supported dashboards:

  • System health
  • Query performance health
  • Data and metadata health
  • Workload health
  • Query latency health
  • Query lifecycle health
  • Anomaly and trend insights
  • Log and error health

The Grafana tool provides support only for the following four dashboards System health, Query performance health, Data and metadata health, and Workload health while the Instana tool supports all the eight dashboards.

The following list represents the default set of Presto (Java) metrics. Users can extend this by adding additional metrics as needed. A full list of available metrics and their definitions can be found in Presto exposed JMX metrics.

System health

Monitoring the underlying infrastructure is paramount for Presto. Focuses on the foundational infrastructure, monitoring core resources like CPU, memory, and I/O to detect bottlenecks and ensure stable operations.

Presto (Java) engine:

  • CPU usage - Monitors CPU usage across Presto instances.

    • process_cpu_seconds_total
  • Memory usage - Tracks total memory used versus available

    • watsonx_data_presto_cluster_memory_manager_cluster_memory_bytes
    • watsonx_data_presto_cluster_memory_manager_leaked_bytes
    • watsonx_data_presto_memory_heap_memory_usage_committed_bytes
    • watsonx_data_presto_memory_heap_memory_usage_max_bytes
    • watsonx_data_presto_memory_non_heap_memory_usage_committed_bytes
    • watsonx_data_presto_memory_non_heap_memory_usage_max_bytes
    • watsonx_data_presto_cluster_memory_manager_cluster_user_memory_reservation
    • watsonx_data_presto_cluster_memory_manager_cluster_total_memory_reservation
    • watsonx_data_presto_cluster_memory_manager_queries_killed_due_to_out_of_memory
    • jvm_memory_bytes_committed
  • Presto memory pool - Tracks memory consumption within the reserved and general memory pools of Presto.

    • watsonx_data_presto_memory_pool_general_max_bytes
    • watsonx_data_presto_cluster_memory_pool_general_nodes
    • watsonx_data_presto_memory_pool_general_free_bytes
    • watsonx_data_presto_memory_pool_general_reserved_bytes
    • watsonx_data_presto_cluster_memory_pool_general_free_distributed_bytes
    • watsonx_data_presto_cluster_memory_pool_general_total_distributed_bytes
    • watsonx_data_presto_cluster_memory_pool_general_reserved_distributed_bytes
    • watsonx_data_presto_cluster_memory_pool_general_reserved_revocable_distributed_bytes
  • Alluxio cache - Tracks the efficiency and usage of cached data in Alluxio during queries.

    • watsonx_data_presto_alluxio_cache_bytes_read_cache_count
    • watsonx_data_presto_alluxio_cache_bytes_requested_external_count
    • watsonx_data_presto_alluxio_cache_written_cache_external_count
    • watsonx_data_presto_alluxio_cache_get_errors_count
    • watsonx_data_presto_alluxio_cache_put_errors_count
    • watsonx_data_presto_alluxio_cache_pages_count
    • watsonx_data_presto_alluxio_cache_pages_evicted_count
    • watsonx_data_presto_alluxio_cache_space_available_value
    • watsonx_data_presto_alluxio_cache_space_used_value

To generate Alluxio cache metrics, the Alluxio cache must be enabled. For more information, refer to Enhancing the query performance through caching

Additionally, ensure the following configurations are included in the jvm.config file: -Dalluxio.metrics.key.including.unique.id.enabled=true -Dalluxio.user.app.id=presto

  • Fragment cache - Tracks usage and hit or miss rates of cached query fragments in Presto.
    • watsonx_data_presto_fragment_cache_stats_cache_entries
    • watsonx_data_presto_fragment_cache_stats_cache_hit
    • watsonx_data_presto_fragment_cache_stats_cache_removal
    • watsonx_data_presto_fragment_cache_stats_cache_size_in_bytes
    • watsonx_data_presto_fragment_cache_stats_inflight_bytes

Query performance

Understanding query behavior is critical for a query engine. Query performance metrics include:

Presto (Java) engine:

  • Currently running queries - Monitors the query request rate.

    • watsonx_data_presto_query_manager_running_queries
  • Query execution time - Tracks query latency.

    • watsonx_data_presto_query_manager_execution_time_five_minutes_p99
  • Data processed - Measures data transfer rates during query execution.

    • watsonx_data_presto_task_manager_input_data_size_five_minute_count
    • watsonx_data_presto_task_manager_output_data_size_five_minute_count
  • Error rates - Indicates the percentage of queries that error out when the system is under stress.

    • watsonx_data_presto_query_manager_user_error_failures_five_minute_count
    • watsonx_data_presto_query_manager_abandoned_queries_five_minute_count
    • watsonx_data_presto_query_manager_canceled_queries_five_minute_count
  • Successful vs failed requests - Tracks successful vs failed request counts.

    • watsonx_data_presto_query_manager_completed_queries_five_minute_count
    • watsonx_data_presto_query_manager_failed_queries_five_minute_count
    • watsonx_data_presto_query_manager_internal_failures_five_minute_count
    • watsonx_data_presto_task_manager_failed_tasks_five_minute_count

Data and metadata health

For a system dealing with vast amounts of data, the health of data ingestion and metadata management is crucial.

Presto (Java) engine:

  • Data ingestion - Query manager - Monitors the volume and rate of data being ingested into the system.

    • watsonx_data_presto_query_manager_consumed_input_bytes_five_minute_count
    • watsonx_data_presto_query_manager_consumed_input_rows_five_minute_count
    • watsonx_data_presto_query_manager_wall_input_bytes_rate_five_minutes_p90
  • S3 Object store errors - Tracks failure metrics while reading data from S3 or object storage layers.

    • watsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_errors_total_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_other_read_errors_total_count
  • Queue metric - Measures the size and processing rate of internal data processing queues.

    • watsonx_data_presto_dispatch_manager_queued_queries
    • watsonx_data_presto_split_scheduler_stats_mixed_split_queues_full_and_waiting_for_source_five_minute_count
    • watsonx_data_presto_task_executor_processor_executor_queued_task_count
    • watsonx_data_presto_task_executor_split_queued_time_all_time_max
    • watsonx_data_presto_task_executor_split_queued_time_all_time_avg
  • File metadata cache metrics - Observes hit/miss rates and efficiency of the metadata cache for file access.

    • watsonx_data_presto_hive_cache_stats_mbean_parquet_metadata_hit_rate
    • watsonx_data_presto_hive_cache_stats_mbean_parquet_metadata_size
    • watsonx_data_presto_hive_cache_stats_mbean_orc_file_tail_size
    • watsonx_data_presto_hive_cache_stats_mbean_orc_file_tail_hit_rate
    • watsonx_data_presto_hive_cache_stats_mbean_stripe_footer_size
    • watsonx_data_presto_hive_cache_stats_mbean_stripe_stream_size

Workload health

Understanding how different workloads interact with the system is key to resource optimization.

Presto (Java) engine:

  • Workload count - Indicates the number of currently running queries.

    • watsonx_data_presto_query_manager_running_queries
  • Status - Indicates whether the workload is active, idle, or failed.

    • watsonx_data_presto_query_manager_completed_queries_five_minute_count
    • watsonx_data_presto_query_manager_abandoned_queries_five_minute_count
    • watsonx_data_presto_query_manager_canceled_queries_five_minute_count
    • watsonx_data_presto_query_manager_failed_queries_five_minute_count
  • Error rates - Error rates

    • watsonx_data_presto_query_manager_user_error_failures_five_minute_count
    • watsonx_data_presto_query_manager_failed_queries_five_minute_count
    • watsonx_data_presto_query_manager_external_failures_five_minute_count
    • watsonx_data_presto_query_manager_internal_failures_five_minute_count
    • watsonx_data_presto_query_manager_insufficient_resources_failures_five_minute_count
  • Resource utilization - Tracks CPU, memory, and disk usage associated with each workload.

    • watsonx_data_presto_query_manager_consumed_cpu_time_seconds_five_minute_count
    • watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p25
    • watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p50
    • watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p75
    • watsonx_data_presto_query_manager_cpu_input_byte_rate_five_minutes_p90
  • Request count - Total number of workload execution requests received over a period of time.

    • watsonx_data_presto_query_manager_running_queries
    • watsonx_data_presto_dispatch_manager_queued_queries

Query lifecycle health

It provides insight into each stage of a query’s journey from submission to execution helping to identify bottlenecks in queuing, task execution, and completion.

Presto (Java) engine:

  • Errors in each stage - Tracks query execution failures across different stages of the Presto instances by identifying the problem areas in the pipeline where tasks are failing.

    • watsonx_data_presto_task_manager_failed_tasks_five_minute_count
  • Resource utilization per query - Captures system resource usage per query, including threads, splits, and queued or executing queries.

    • watsonx_data_presto_task_executor_running_tasks_level0
    • watsonx_data_presto_task_executor_running_splits
    • watsonx_data_presto_query_manager_submitted_queries_five_minute_count
    • watsonx_data_presto_query_manager_queued_queries
    • watsonx_data_presto_dispatch_manager_queued_queries
    • watsonx_data_presto_task_executor_blocked_splits
  • Executor pool health - Monitors the internal thread pool used to run Presto tasks.

    • watsonx_data_presto_task_executor_processor_executor_pool_size
    • watsonx_data_presto_task_executor_processor_executor_active_count
    • watsonx_data_presto_task_executor_processor_executor_completed_task_count
    • watsonx_data_presto_task_executor_processor_executor_queued_task_count
  • Split CPU time - Tracks CPU time consumed by leaf and intermediate splits.

    • watsonx_data_presto_task_executor_intermediate_split_cpu_time_count
    • watsonx_data_presto_query_manager_consumed_cpu_time_seconds_five_minute_count
    • watsonx_data_presto_task_executor_leaf_split_cpu_time_p99

Query latency health

Focuses on the execution phase of queries identifying latency sources and the impact of query complexity.

Presto (Java) engine:

  • Latency (ms) - Measures execution time across various stages.

    • watsonx_data_presto_query_manager_execution_time_five_minutes_p99
    • watsonx_data_presto_task_executor_split_wall_time_one_minute_max
    • watsonx_data_presto_task_executor_split_wall_time_fifteen_minutes_max
    • watsonx_data_presto_task_executor_split_wall_time_all_time_p99
    • watsonx_data_presto_task_executor_leaf_split_wall_time_p99
  • Request Volume - Tracks task, split, and scheduling activity.

    • watsonx_data_presto_task_executor_split_queued_time_five_minutes_count
    • watsonx_data_presto_split_scheduler_stats_get_split_time_five_minutes_p99
    • watsonx_data_presto_task_executor_split_wall_time_five_minutes_count

Log and error health

Monitors errors and failures across the query execution pipeline, highlighting system stability and failure patterns.

Presto (Java) engine:

  • Query failure rate - Tracks execution failures and bottlenecks.

    • watsonx_data_presto_query_manager_failed_queries_five_minute_count
    • watsonx_data_presto_query_manager_internal_failures_five_minute_count
    • watsonx_data_presto_query_manager_user_error_failures_five_minute_count
    • watsonx_data_presto_task_executor_split_skipped_due_to_memory_pressure_five_minute_count
  • Service/component affected - Hive S3 / FileSystem - Identifies failing Presto components.

    • watsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_aws_retry_count_fifteen_minute_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_errors_total_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_socket_timeout_exceptions_total_count
  • Service/component affected - Task Executor - Identifies failing Presto components.

    • watsonx_data_presto_task_executor_split_wall_time_all_time_max_error
    • watsonx_data_presto_task_executor_blocked_quanta_wall_time_all_time_max_error
    • watsonx_data_presto_task_executor_leaf_split_cpu_time_max_error
    • watsonx_data_presto_task_executor_intermediate_split_wall_time_max_error
    • watsonx_data_presto_task_executor_unblocked_quanta_wall_time_one_minute_max_error
    • watsonx_data_presto_task_executor_split_queued_time_one_minute_max_error
    • watsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_aws_retry_count_fifteen_minute_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_errors_total_count
    • watsonx_data_presto_hive_s3_presto_s3_file_system_socket_timeout_exceptions_total_count
  • Severity level - Categorizes metrics by impact severity.

    • Severe (Critical)

      • watsonx_data_presto_query_manager_internal_failures_five_minute_count
      • watsonx_data_presto_task_executor_split_wall_time_all_time_max_error
      • watsonx_data_presto_task_executor_blocked_quanta_wall_time_all_time_max_error
      • watsonx_data_presto_hive_s3_presto_s3_file_system_failed_uploads_total_count
      • watsonx_data_presto_cache_stats_quota_exceeded
    • Moderate (Warning)

      • watsonx_data_presto_query_manager_user_error_failures_five_minute_count
      • watsonx_data_presto_hive_s3_presto_s3_file_system_aws_retry_count_fifteen_minute_rate
      • watsonx_data_presto_hive_s3_presto_s3_file_system_get_object_errors_fifteen_minute_rate
      • watsonx_data_presto_hive_s3_presto_s3_file_system_read_retries_fifteen_minute_rate
    • Low (Info)

      • watsonx_data_presto_hive_s3_presto_s3_file_system_get_metadata_retries_five_minute_count
      • watsonx_data_presto_task_executor_split_skipped_due_to_memory_pressure_total_count
      • watsonx_data_presto_task_executor_processor_executor_shutdown

Anomaly and trend insights

Highlights unexpected patterns or deviations in query behavior, helping detect performance degradation or improvements.

Presto (Java) engine:

  • Latency drift - Tracks evolving query latencies.

    • watsonx_data_presto_task_executor_split_wall_time_fifteen_minutes_avg
    • watsonx_data_presto_task_executor_leaf_split_wait_time_avg
    • watsonx_data_presto_task_executor_intermediate_split_wall_time_avg
  • Error rate vs baseline - Compares recent execution metrics to historical baselines.

    • watsonx_data_presto_task_executor_split_wall_time_fifteen_minutes_avg
    • watsonx_data_presto_task_executor_leaf_split_wait_time_avg
    • watsonx_data_presto_task_executor_intermediate_split_wall_time_avg
  • Throughput drop detector - Detects dips in data processing rates.

    • watsonx_data_presto_task_executor_global_scheduled_time_micros_five_minute_rate
    • watsonx_data_presto_hive_s3_presto_s3_file_system_successful_uploads_five_minute_rate
  • Query duration - Measures average execution time per task or split.

    • watsonx_data_presto_task_executor_leaf_split_cpu_time_avg
    • watsonx_data_presto_task_executor_intermediate_split_cpu_time_avg
  • Memory trend - Tracks memory usage and potential leaks.

    • jvm_memory_bytes_used
    • watsonx_data_presto_memory_heap_memory_usage_used_bytes
  • Workload trend comparison - Compares resource usage across time windows.

    • watsonx_data_presto_task_executor_global_cpu_time_micros_total_count
    • watsonx_data_presto_cluster_memory_manager_cluster_total_memory_reservation
    • watsonx_data_presto_task_executor_blocked_quanta_wall_time_fifteen_minutes_avg