This task tracks onboarding Analytics: service-level hadoop/hive/druid/eventlogging/etc Prometheus-based alerts from Icinga to AlertManager.
These are effectively "low hanging fruits" since it is simple to convert check_prometheus-based alerts into alerting rules.checks are within scope:
__Master checks__
- hadoop-hdfs-capacity-remaining-percent
- hadoop-hdfs-corrupt-blocks
- hadoop-hdfs-missing-blocks
- hadoop-hdfs-total-files-heap
- hadoop-yarn-unhealthy-workers
- hadoop-hdfs-namenode-heap-usage
- hadoop-yarn-resourcemananager-heap-usage
__Worker checks__
- analytics_hadoop_hdfs_datanode
- analytics_hadoop_yarn_nodemanager
__Standby master checks__
- hadoop-hdfs-namenode-heap-usage
- hadoop-yarn-resourcemananager-heap-usage
__Hive checks__
- hive-metastore-heap-usage
- hive-server-heap-usage
__Druid checks__
- druid_netflow_supervisor
- druid_coordinator_segments_unavailable_analytics
- druid_coordinator_segments_unavailable_public
__Eventlogging checks__
- eventlogging_EventError_throughput
- eventlogging_NavigationTiming_throughput
- eventlogging_throughput
- eventlogging_processors_kafka_lag
- Each of: ${eventgate_service}_validation_error_rate
- Each of: eventgate_logging_external_latency_${site}
- Each of: eventgate_logging_external_errors_${site}
I am not certain whether or not the following are in scope, but they might be:
- Kafka checks
- Labstore checks
- Zookeeper checks
For more context, see the doc for AlertManager.